Predicting GPU Performance for AMD and Nvidia

dkanter

Regular
My friend Willard mentioned this in the news section, but I want to share the article in the forums as well.

Modern graphics processors are incredibly complex, but understanding their performance is essential, as they become an increasingly important component of computer systems. In this report at RWT, we use a set of benchmark results to build accurate performance models for AMD and Nvidia GPUs. We verify that our model can predict performance within roughly 6-8% for many desktop graphics cards and show how Nvidia’s microarchitecture and drivers achieve roughly 2X higher utilization than AMD’s VLIW5 design.

For those interested in graphics technology, this is a fascinating read and you are sure to learn quite a bit about GPU performance and how to make your own predictions and analysis:

http://www.realworldtech.com/page.cfm?ArticleID=RWT041111203710


Enjoy,


David
 
interesting yet obivious :p , given the varance of the extrapalated AMD results with the limited number of AMD cards it has been extrapalated for, i think its a little bit generious to call it accurate for AMD cards. it still could be, but to me you just need more datat points verify.


cheers
 
There is a decent cluster of results above the fitted lines further down on the performance scale, particularly for Nvidia.
The performance seems to track most tightly with the model at the highest FLOP and score counts.

There could be some interesting second-order effects that might determine where the change in the slope goes.
The graph doesn't distinguish between architectures, though I'm not sure how to make that distinction without making the plot unreadable.
 
There is a decent cluster of results above the fitted lines further down on the performance scale, particularly for Nvidia.
The performance seems to track most tightly with the model at the highest FLOP and score counts.

There could be some interesting second-order effects that might determine where the change in the slope goes.
The graph doesn't distinguish between architectures, though I'm not sure how to make that distinction without making the plot unreadable.

Yeah, it would be nice to have models for each teh different architectures, but that would have been more work : )

Also, I will be revisiting the topic in the future.

David
 
Interesting article Dave but I hope it's just a taste of more to come. It's a single factor model so it's no real surprise the regression came back so strong on flops. Would be far more interesting to see a multiple regression with at least flops, texturing and bandwidth in the model. Also a more diverse set of workloads would be nice but you mentioned that already :)
 
Interesting article Dave but I hope it's just a taste of more to come. It's a single factor model so it's no real surprise the regression came back so strong on flops. Would be far more interesting to see a multiple regression with at least flops, texturing and bandwidth in the model. Also a more diverse set of workloads would be nice but you mentioned that already :)

Start simple : ) My model assumes that NV/AMD aren't going to fuck up the bandwidth, texturing, etc.

As you said though, a better model would include more factors.

David
 
oooh, econometrics, finaly something I will be able to fully understand. :D Looking forward what else David has up his sleeve, I am sure he realises what mess all this model making can be become. So far, IMO what has been shown is, that manufacturers balance their GPUs quite well, except the low end, which isn't quite a big discovery. ;) If anything is above the line it works more efficient than average, if it's below it's bottlenecked somewhere. Further study will see many problems ahead, even if we stick to one application.

First, different architectures will give different results. David omitted Caymans for this reason, but he inlcuded both regular Fermis (GF100) and the modified version (GF104). And look what happens, the GF460 gets overrated and GF5X0s get underrated by the model. Interestingly, this doesn't quite show up in other cards based on Fermi, there might be other reasons why not. :) Second, there will be strong corelation between the parameters of the cards, because usualy most cards are well balanced ie. more shading power comes along with more texturing power etc. Will be hard to precisly tell what exactly influences the end result and by how much. Particulary when one cannot live without the other. Some results can be simulated with one card, but I doubt you can, for example, shut down ROPs via BIOS or whatever, and see what happens with performance. Third, the response of all factors is not going to be linear - dimnishing results from adding more of one thing, without toching the rest. Possibly with hitting a wall, giving no improvemnt at all, if there's a bottleneck somehwere else. So that one thing is going to make model building tricky, the traditional linear regression won't cut it.

Like I said, very curious how David is going to tackle it all. :)

PS. How well does a 475GFLOP part with a P3335 (unfortunately no graphics score) fit in? Should be very close, right?
 
Nice and interesting, hope to see more of it soon.

I agree with the use of a simple model. If forecasting is the end goal then a simpler model usually gives more accurate results (based at least on my experience at university which is not that extensive).

If you can get the forecast to be within 5% of the actual result then i would be pretty happy.
 
Nice and interesting, hope to see more of it soon.

I agree with the use of a simple model. If forecasting is the end goal then a simpler model usually gives more accurate results (based at least on my experience at university which is not that extensive).

If you can get the forecast to be within 5% of the actual result then i would be pretty happy.

If he could get within 5%, he would possess psychic powers sufficiently good that he would be whisked off the face of the earth by the illuminati never to be seen again. :)
These kind of exercises are IMHO mostly useful for sorting out the major contributing factors, and gaining perspective on a particular area. Simplifying is part of that process, but it also means that the predictive ability is even lower. Then, if you use the model to make predictions outside the problem-set where you collect your data.....
 
Well, to be accurate, it is more like 4x.... Of course, if half the units support full speed DP, it isn't quite so bad after shader clocks are considered. Otherwise, the actual real world performance of the cards would be quite baffling indeed.
 
Originally Posted by dkanter
Nvidia’s microarchitecture and drivers achieve roughly 2X higher utilization than AMD’s VLIW5 design.

what does that mean in plain english (for the hard of thinking like myself)
does vliw5 spend a lot of time doing nothing ?
 
what does that mean in plain english (for the hard of thinking like myself)
does vliw5 spend a lot of time doing nothing ?

I think that means:

(NV real flops / NV peak flops) = 2 × (AMD real flops / AMD peak flops).


That doesn't mean AMD's VLIW5 spends a lot of time doing nothing, but it sure spends a lot of time under-utilised.
 
what does that mean in plain english (for the hard of thinking like myself)
does vliw5 spend a lot of time doing nothing ?
I'd interpret it as meaning if they X & Y have the same number of floating point Adders/Multipliers, then maybe X (who perhaps use a narrower vector or maybe just scalar ops) might, on average,have fewer sitting idle per clock than Y <shrug>

[Edited due to Florin's comment below]
 
I'd interpret it as meaning if they both have the same number of floating point Adders/Multipliers, then NV are claiming that, on average, they have fewer sitting idle per clock than AMD <shrug>

Well, David is claiming something, not Nvidia. Methinks.
 
It will be interesting to see how things look on next generation engines where compute shaders become far more prominent. The issue with VLIW is that when a unit is idle there are a lot of flops going to waste.

Take this snippet for example from DICE's BF3 DX11 presentation. I'm not sure how much work the intersects function does but in batches with active lights, any thread processing a "null" light is tossing a lot of flops away in a VLIW design. I'm assuming any batch within the work group that consists of all "null" lights will just branch around it at little cost.

The other thing to note is the use of shared memory atomics at several stages of DICE's approach. Not sure if AMD's significant advantage there will result in tangible benefits in BF3 but it's something to look out for.

Code:
uint threadCount = BLOCK_SIZE*BLOCK_SIZE; 
uint passCount = (lightCount+threadCount-1) / threadCount;
 
for (uint passIt = 0; passIt < passCount; ++passIt) 
{
   uint lightIndex = passIt*threadCount + groupIndex;
 
   // prevent overrun by clamping to a last "null" light
   lightIndex = min(lightIndex, lightCount); 
 
   if (intersects(lights[lightIndex], tile)) 
   {
       uint offset;
       InterlockedAdd(visibleLightCount, 1, offset);
       visibleLightIndices[offset] = lightIndex;
   } 
}
 
what does that mean in plain english (for the hard of thinking like myself)
does vliw5 spend a lot of time doing nothing ?

Actually your interpretation is pretty close. What it means is that on average you probably only see 2 useful operations per VLIW5 bundle. So you're talking about 40% utilization. Nvidia is probably closer to 80%.

Of course, the reality is that AMD also has WAY more shaders, so their performance overall looks quite good.

David
 
I'd interpret it as meaning if they X & Y have the same number of floating point Adders/Multipliers, then maybe X (who perhaps use a narrower vector or maybe just scalar ops) might, on average,have fewer sitting idle per clock than Y <shrug>

[Edited due to Florin's comment below]

I think the right way to conceptualize it is that because AMD's shader units require instruction level parallelism (ILP), they are usually underutilized compared to Nvidia's shader units. However, they are also more power and area efficient.

David
 
Just for clarifying my understanding: please correct me if I misunderstood.

If going by an overall average, saying that AMD's design in the end only reaches ~40% of its ALU capability sounds plausible.
Saying that AMD is only able to find 2 worthwhile instructions per bundle in my mind equates the utilization of issue slots to the utilization of the hardware. While it provides a theoretical bound, AMD has claimed higher packing efficiency for its compiler than 2 per bundle.
My interpretation of the situation would be that it is other sources of stalls and harsher penalties for branch divergence that reduce utilization, but these would be orthogonal to what is in a VLIW bundle. It could be problematic even with a code stream devoid of NOPs.
 
Back
Top