AMD: R9xx Speculation

Major architectural changes are introduced to market several years after the designers made a prediction what changes would pay off in the final design.
If the chip design pipeline waits until a feature pays off, it won't come out for two or so years.
The choice is to pick possible winners as best you can way before you know the answer, or guarantee you are late to the party.

This is a popular line around here, but I think it's largely a fallacy, and furthermore not really applicable to internal implementation details such as the Vec5 vs. Vec4 or balancing of ALU vs. TMUs vs. ROPs.
As you say, the chip design pipeline is at least 2-3 years when it comes to something like Cayman. (Anand gave a year of design that seemed reasonable.) Its new architectural features have been known internally for a long time, down to the nittiest of gritty details. So when the product is available for consumer purchase, unless the driver team has been completely sleeping at the wheel, the drivers should pretty much exploit the new design. It's running DX11 code. It won't get more efficient with time.

And when we are talking about externally accessible features (which is not really the case with Cayman vs Barts or 58xx, since they are all accessed through DX11), we have to remember that we are not talking about an open market, but what is effectively a duopoly. There is no way that software is going to require new hardware features before they are available. And from availability in hardware to widespread use in software there is a significant lag. A few sponsored titles may make use of a new feature, but developers need to sell their games to a market, and if the installed base isn't there, a new feature will be either completely ignored or possibly optional. Or put another way, if a feature doesn't have the blessing of Microsoft and the backing of both IHVs, it's effectively dead in the water, and even with full support it won't see widespread use until there is a substantial installed base. Did lack of TruForm hurt nVidia much? Did ATIs lack of SM3 ever matter outside forum fanboi bickering?

Pretty much by definition, there is no way either ATI or nVidia can be surprised by, and therefore too late to market with an essential feature.


So no, when I look at Cayman, I don't see a design that is ahead of its time, I see a design that, for reasons that aren't obvious, didn't quite hit the mark. There is no substantial benefit to be seen, on the contrary it even seems to lag behind Barts in efficiency. But then again there are other aspects of it, such as the new power management and AA mode are welcome additions to Barts. It would be gross exaggeration to call the design a failure - it simply hasn't quite lived up to what could be expected of AMDs reengineering effort.
 
This is a popular line around here, but I think it's largely a fallacy, and furthermore not really applicable to internal implementation details such as the Vec5 vs. Vec4 or balancing of ALU vs. TMUs vs. ROPs.
Those tradeoffs are informed by simulation of workloads, some coming from current workloads and some based on projections of future ones.
I do not understand how relying on projections of future workloads is not without the risk of estimating wrong.

As you say, the chip design pipeline is at least 2-3 years when it comes to something like Cayman. (Anand gave a year of design that seemed reasonable.) Its new architectural features have been known internally for a long time, down to the nittiest of gritty details. So when the product is available for consumer purchase, unless the driver team has been completely sleeping at the wheel, the drivers should pretty much exploit the new design. It's running DX11 code. It won't get more efficient with time.
This is patently untrue, as the neverending stream of driver updates, still unused instructions in the ISA, and notable examples of lousy code generation in the compute side indicate.
Just how exactly does the driver team evaluate their drivers in the wild without hardware to run it?

Also, how do they predict what their competitors acheive in the same time frame? It certainly seems that AMD is fighting a rearguard action when it comes to geometry throughput in the press.

Pretty much by definition, there is no way either ATI or nVidia can be surprised by, and therefore too late to market with an essential feature.
There is the question of whether a design supports a feature and the question on whether it is well-implemented or competitive.

So no, when I look at Cayman, I don't see a design that is ahead of its time, I see a design that, for reasons that aren't obvious, didn't quite hit the mark.
What is the mark you speak of?
The state of the market, process technology, and competition that was not known when the major changes were planned?
 
Pretty much by definition, there is no way either ATI or nVidia can be surprised by, and therefore too late to market with an essential feature.

Proof against: R600 was designed to be full speed FP16 throughout. A decision, AMD later admitted, which came way to early for real world workloads. And that with them having designed the Xbox 360 chip themselves, so they could have gotten early hints at what formats would be used by games ported from the console to PC.
 
Those tradeoffs are informed by simulation of workloads, some coming from current workloads and some based on projections of future ones.
I do not understand how relying on projections of future workloads is not without the risk of estimating wrong.

I thought that was what I said - that the redesign (for whatever reason) didn't quite hit the mark.

This is patently untrue, as the neverending stream of driver updates, still unused instructions in the ISA, and notable examples of lousy code generation in the compute side indicate.
Just how exactly does the driver team evaluate their drivers in the wild without hardware to run it?
The neverending stream of driver updates typically address specific game issues, or bugs. Not making Direct3D in general run faster.
[I don't want to get into a GPGPU relevance discussion. It is orthogonal to gaming, (but obviously not to efficiency if GPGPU features come at the cost of gates/size/power).]
The driver team needs to translate DX API calls to GPU code, which is relatively limited task. They know how Cayman differ from Cypress. They have had a couple of years or so. I believe them to be competent enough to have done their job. Of course there will be issues with certain applications that they have not been able to catch, but the basic interpretation of the DX11 API should be a done, and well functioning deal.


Also, how do they predict what their competitors acheive in the same time frame? It certainly seems that AMD is fighting a rearguard action when it comes to geometry throughput in the press.
The press isn't the market, nor does it represent game developers. The press needs controversy to drive interest, to drive page-hits, to drive advertising revenue. Do results in Unigine Heaven "extreme" really have any predictive value as to how a game you pick from the shelf at GameStop will perform? Of course not.

There is the question of whether a design supports a feature and the question on whether it is well-implemented or competitive.
True.


What is the mark you speak of? The state of the market, process technology, and competition that was not known when the major changes were planned?

No, the mark I speak of is the state of the market, process technology, and competition as it actually is today when the card is introduced to compete for consumer favor. As you implied yourself in your first paragraph, and as I indicated that I had the feeling of myself, it seems as if they aimed for something sufficiently different from the reality that is, that people get disappointed. Personally I'm disappointed with lack of advances in either performance/die area or performance/W over Barts. I had simply hoped for more in those areas, particularly since the PR surrounding Vec4 indicated that it on average should be more efficient than its predecessor.
 
Proof against: R600 was designed to be full speed FP16 throughout. A decision, AMD later admitted, which came way to early for real world workloads. And that with them having designed the Xbox 360 chip themselves, so they could have gotten early hints at what formats would be used by games ported from the console to PC.
But as you say, that is an example of a feature being too early, not too late. :)
Too early, or simply irrelevant, is very different from an essential feature being too late.
 
CarstenS: I read somewhere an explantion, that many DX10 requirements were withdrawn during its development and standardized quality of texture filtering was one of them. I have no idea, if this story is real, but it would explain the significant changes of texture-filtering hardware on both sides.
 
The neverending stream of driver updates typically address specific game issues, or bugs. Not making Direct3D in general run faster.
[I don't want to get into a GPGPU relevance discussion. It is orthogonal to gaming, (but obviously not to efficiency if GPGPU features come at the cost of gates/size/power).]
The driver team needs to translate DX API calls to GPU code, which is relatively limited task. They know how Cayman differ from Cypress. They have had a couple of years or so. I believe them to be competent enough to have done their job. Of course there will be issues with certain applications that they have not been able to catch, but the basic interpretation of the DX11 API should be a done, and well functioning deal.

Given how often driver releases have "10% increase in average fps, for such and such application" I think general speed ups for *specific* titles are indeed part of the constant driver optimization process. And since the performance across multiple titles is highly inconsistent, with some games getting near 580 levels and others barely being faster than the 5870, there's probably room to raise scores on the lower end of things, which would shift the average scores to a much more comfortable margin over the 570. Some of those issues are probably bottlenecks, but it's unlikely that all of them are, especially since some of the low scores are occurring in areas where we should expect a speed up.

Converting API calls directly into gpu code without taking into context how those API calls are being used is never going to get you decent optimization. And having to optimize in context is a *lot* more complicated.

But as you say, that is an example of a feature being too early, not too late. :)
Too early, or simply irrelevant, is very different from an essential feature being too late.

I would simply point you at the case of the NV30. It *can* definitely happen. Usually it's because something went *very* wrong, but it can happen.
 
But as you say, that is an example of a feature being too early, not too late. :)
Too early, or simply irrelevant, is very different from an essential feature being too late.

Irrelevance is an after-the-fact determination.
As an example, R600's TMUs could fetch 4 filtered texels and four point samples.
If things had gone the way ATI was hoping, it wouldn't have quietly dropped the capability in later designs.
 
For me the most inconsistent result is the civ5 texture decompression benchmark, as shown here for example:
http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950/23
The HD 6970 trailing the HD 5870 (very significantly), whereas the HD6950 actually being slightly slower than HD 6870 even...
This is a compute workload, and VLIW4 is big fail obviously here. I'd certainly hope this is due to drivers...
I'm unsure though what to expect generally from driver updates. The non-scaling of the simds could mean that even if a shader gets compiled to less VLIW4 instructions, it might ultimately not really matter as the bottlenecks are elsewhere anyway. Some of the tesselation demos though don't quite seem to show the gains they should, so I guess driver updates could increase performance there.
 
I wonder if they would still be using VLIW5 if DX10 had adopted ATI's former version of tessellation. That would amplify the triangles before the vertex shader stage, presumably leading to better utilization of the 5 slots. Does anyone know if DX11 tessellation stages (domain, hull) use all five slots?
 
This is a popular line around here, but I think it's largely a fallacy, and furthermore not really applicable to internal implementation details such as the Vec5 vs. Vec4 or balancing of ALU vs. TMUs vs. ROPs.
As you say, the chip design pipeline is at least 2-3 years when it comes to something like Cayman. (Anand gave a year of design that seemed reasonable.) Its new architectural features have been known internally for a long time, down to the nittiest of gritty details. So when the product is available for consumer purchase, unless the driver team has been completely sleeping at the wheel, the drivers should pretty much exploit the new design. It's running DX11 code. It won't get more efficient with time.

To add to the untruth, Anandtech wrote that AMD found that on average, their shaders were being utilized 3.4/5 times

AMD could not have known this til after a few years seeing where development had gone, so most likely early in the Cypress lifetime
 
This is quite telling picture from computerbase.de:
2.jpg


We'll see how drivers are going to help for smaller resolutions and settings.
 
Hmm just realized AMD still does not downvolt gddr5 memory at idle, nor downclock/downvolt memory when using multiple monitors or just playing back blu-ray. This is quite disappointing, and consequently the HD 6970 posts a new "record" for blu-ray power consumption, easily surpassing such energy efficient cards as GTX 480 or HD 4890 (which hold the previous record for single card) for that task:
http://ht4u.net/reviews/2010/amd_radeon_hd_6970_6950_cayman_test/index23.php

btw the only voltage numbers I found so far:
http://ht4u.net/reviews/2010/amd_radeon_hd_6970_6950_cayman_test/index17.php
So those last 80Mhz for HD 6970 definitely take their toll on voltage (and hence power consumption).
Also I was right speculating that AMD uses the same voltage (1.6V) for both the 6 Gbps chips on the HD 6970 (specced for 1.6V but they are run at 10% slower clock than spec) and the 5 Gbps chips on the HD 6950 (specced at 1.5V but run at spec clock). So it is indeed not surprising that HD 6950 memory will overclock to (nearly) HD 6970 levels (since the 6 Gbps chips are probably just barely better), though CCC limits that.
 
This is quite telling picture from computerbase.de:
2.jpg


We'll see how drivers are going to help for smaller resolutions and settings.
Well yes, it tells you there's definitely some benefit to the 2GB of memory. I think though if you'd look at the individual results, you'd see even with that high resolution and AA setting, only some games benefit (and others completely tank). Actually, it tells you that 1.5GB is "enough" :).
Of course, with EyeFinity this makes sense - 3 full hd displays have even more resolution than that (6MP vs. 4MP) and you'd see a difference even with 4xAA but still I guess a majority is still only using a single full hd display.
 
This is quite telling picture from computerbase.de:
2.jpg


We'll see how drivers are going to help for smaller resolutions and settings.

I think Nvidia are missing a trick by not having a 3GB card, but I guess it's a pretty difficult proposition getting a GF110 and 12 2Gbit GDDR5 chips onto a single board with adequate cooling and low enough power requirements. Maybe Nvidia should work on their own power limiter that isn't app based.
 
I think Nvidia are missing a trick by not having a 3GB card, but I guess it's a pretty difficult proposition getting a GF110 and 12 2Gbit GDDR5 chips onto a single board with adequate cooling and low enough power requirements. Maybe Nvidia should work on their own power limiter that isn't app based.

I don't think the 2Gbit chips really use more power (or at least not a lot more) than the 1Gbit chips did. That said, there doesn't appear to be a datasheet for the hynix 2gbit gddr5 part (there is one for the 1gbit hynix part) hence I have no evidence for this - but I think that's generally true as most of the power comes from the active parts of the ram chip, and there's no difference there obviously between a 1gbit and a 2gbit chip.
 
I really wonder how the 2GB variant of the HD 5870 would fare in this test…

Techreport includes results for 5870 2GB version, it's generally the same speed or 1 fps faster than 5870 1GB at 2560x1600.

So it wouldn't be very different from 5870 on that graph. I think the main benefits for the 2GB on 5870 was in eyefinity setups.

Regards,
SB
 
If it is the VLIW5->4 drivers causing the confusing results....will AMD continue to support/patch Cayman until they work properly or will AMD drop them the moment 28nm cards are out?
 
Back
Top