Predicting GPU Performance for AMD and Nvidia

The issue with VLIW is that when a unit is idle there are a lot of flops going to waste.

Take this snippet for example from DICE's BF3 DX11 presentation. I'm not sure how much work the intersects function does but in batches with active lights, any thread processing a "null" light is tossing a lot of flops away in a VLIW design. I'm assuming any batch within the work group that consists of all "null" lights will just branch around it at little cost.
AMD's VLIW approach doesn't change anything with that "null light" in the code. What is more important there is the wavefront or the warp size, respectively.
Actually your interpretation is pretty close. What it means is that on average you probably only see 2 useful operations per VLIW5 bundle. So you're talking about 40% utilization. Nvidia is probably closer to 80%.

Of course, the reality is that AMD also has WAY more shaders, so their performance overall looks quite good.
I really doubt the average utilization is only 40%, i.e. 2/5 slots. Actually I think it is almost hard to find useful code with such a low utilization. The average is probably closer to 60%+.

But one has to keep in mind, that there is a lot more to game performance than shader power (it is often not a bottleneck!). And it is well known that AMD GPUs are keeping a lot of things much simpler than nvidia GPUs, hence giving up some performance for the sake of a higher raw unit count (for a certain die size). That pertains to the shader units, the scheduling, the cache structure and is quite significant in case of the geometry and raster engine. Just saying one can't get a high average utilization of the shader units is the reason for the lower performance is a way oversimplified picture.
 
AMD's VLIW approach doesn't change anything with that "null light" in the code. What is more important there is the wavefront or the warp size, respectively.

Of course it does. An idle AMD thread wastes 10 flops per cycle, an idle nVidia thread burns only 2. The fact that AMD's wavefront size is twice nVidia's just makes it even worse.
 
Of course it does. An idle AMD thread wastes 10 flops per cycle, an idle nVidia thread burns only 2.
Relatively speaking, both are wasting the same amount (*). In fact, if there are short and simple IF ELSE clauses which are transformed to conditional moves, the utilization of the VLIW units may peak (because the two branches are executed in parallel in a single thread). That can (partly) compensate for the wasted flops, no (or less) cycles are wasted.

(*):
Imagine a IF ELSE clause where one thread is masked for one branch where 12 operations are executed. Both architectures loose those 12 operations as useful work. It may take 12 cycles or maybe 14 cycles on nvidia and maybe 4 VLIW instructions (and only 4 cycles) on AMD (which have potentially 20 flops), but considering the average utilization, both loose the same relative (to the total shader power) amount. Deviations for individual cases lie within the limits of the shader utilization variations.
 
There are certainly cases where it would balance out but on average we know that AMD's 4x(cayman) flops advantage per thread per clock doesn't translate into 4x fewer cycles to complete the same work. Normalizing for unit counts and clock speeds a 6970 should be somewhere around 1.7x faster than a 580 at compute stuff on average.
 
There are certainly cases where it would balance out but on average we know that AMD's 4x(cayman) flops advantage per thread per clock doesn't translate into 4x fewer cycles to complete the same work. Normalizing for unit counts and clock speeds a 6970 should be somewhere around 1.7x faster than a 580 at compute stuff on average.
GTX580: 512 ALUs @ 1550 MHz => 1.59 GFlop/s theoretical peak
HD6970: 1536 ALUs @ 880 MHz => 2.70 GFlop/s theoretical peak

That means your x1.7 number is only valid for a workload perfectly fitting AMD's VLIW architecture. Otherwise it is significantly less (assuming a 3/4 utilization of the slots it gets reduced to <1.3). And in pure shader arithmetics, AMD's HD6970 is often at least as fast as nvidia's GTX580 (and it definitely needs less cycles for that ;)). But as I said before (second part), there is a lot more important stuff to get a high gaming performance than just shader power. It is simply not bottlenecked by shader power for quite a bit of the time needed to render a frame.

In a certain scenario, a HD6970 may need 10 ms for a complete frame, but is bottlenecked by the shaders for lets say only 3 ms of that. For the same workload, a GTX580 may spend 4 ms in a shader limited state. But that means not much for the total frame rendering time which may be only 9ms for the GTX580 and thus faster than the HD6970 despite the shader power disadvantage (which is of course highly dependent on the workload).

But that has nothing to do with my statement, that the VLIW architecture does not loose more relative shader power with branches than a scalar one ;)

An even simpler example for that:

If (threadID is odd) {
do 100 flops
}else {
do some other 100 flops
}

Taking the branch is equally distributed, both GPUs are simply wasting half of the instructions because it executes always both halves and half of the SIMD lanes have to be masked out for each side of the branch. The utilization or the question scalar/VLIW doesn't change anything on the fact that both architectures waste half of their compute power. How much that may be or what restrictions apply simply doesn't matter.
 
But that has nothing to do with my statement, that the VLIW architecture does not loose more relative shader power with branches than a scalar one ;)

An even simpler example for that:

If (threadID is odd) {
do 100 flops
}else {
do some other 100 flops
}

Taking the branch is equally distributed, both GPUs are simply wasting half of the instructions because it executes always both halves and half of the SIMD lanes have to be masked out for each side of the branch. The utilization or the question scalar/VLIW doesn't change anything on the fact that both architectures waste half of their compute power. How much that may be or what restrictions apply simply doesn't matter.

But this is not very realistic. It's generally possible to "shuffle" half of these works into a continuous threadID group so there is no need to waste half FLOPs. However, if it's like this:

Code:
if(some_condition(threadID)) {
   do something
}
else {
   do some other things
}

In this case, a smaller warp size (or wavefront size) will be better on average.
 
GTX580: 512 ALUs @ 1550 MHz => 1.59 GFlop/s theoretical peak
HD6970: 1536 ALUs @ 880 MHz => 2.70 GFlop/s theoretical peak

That means your x1.7 number is only valid for a workload perfectly fitting AMD's VLIW architecture. Otherwise it is significantly less (assuming a 3/4 utilization of the slots it gets reduced to <1.3). And in pure shader arithmetics, AMD's HD6970 is often at least as fast as nvidia's GTX580 (and it definitely needs less cycles for that ;)). But as I said before (second part), there is a lot more important stuff to get a high gaming performance than just shader power. It is simply not bottlenecked by shader power for quite a bit of the time needed to render a frame.

In a certain scenario, a HD6970 may need 10 ms for a complete frame, but is bottlenecked by the shaders for lets say only 3 ms of that. For the same workload, a GTX580 may spend 4 ms in a shader limited state. But that means not much for the total frame rendering time which may be only 9ms for the GTX580 and thus faster than the HD6970 despite the shader power disadvantage (which is of course highly dependent on the workload).

Completely agree with the above but it's irrelevant since we're discussing compute only. :)

Gipsel said:
But that has nothing to do with my statement, that the VLIW architecture does not loose more relative shader power with branches than a scalar one ;)

An even simpler example for that:

If (threadID is odd) {
do 100 flops
}else {
do some other 100 flops
}

Taking the branch is equally distributed, both GPUs are simply wasting half of the instructions because it executes always both halves and half of the SIMD lanes have to be masked out for each side of the branch. The utilization or the question scalar/VLIW doesn't change anything on the fact that both architectures waste half of their compute power. How much that may be or what restrictions apply simply doesn't matter.

The scenario I described in the DICE code is more like this:

If (condition) {
do 100 flops
}else {
do nothing
}

In that case your bigger VLIW threadgroups will waste a greater percentage of theoretical throughput.
 
But this is not very realistic. It's generally possible to "shuffle" half of these works into a continuous threadID group so there is no need to waste half FLOPs. However, if it's like this:

Code:
if(some_condition(threadID)) {
   do something
}
else {
   do some other things
}

In this case, a smaller warp size (or wavefront size) will be better on average.
Which was exactly my point:
AMD's VLIW approach doesn't change anything with that "null light" in the code. What is more important there is the wavefront or the warp size, respectively.
;)
 
Completely agree with the above but it's irrelevant since we're discussing compute only. :)
I thought otherwise, or to quote from the article linked in the starting post:
DK's article said:
For performance, we will use the graphics score of 3DMark Vantage.
:p
The scenario I described in the DICE code is more like this:

If (condition) {
do 100 flops
}else {
do nothing
}

In that case your bigger VLIW threadgroups will waste a greater percentage of theoretical throughput.
One side of the branch being empty doesn't change anything in that matter.
And in principle, the threadgroup size is not a function of the VLIW size. A VLIW unit does not execute instructions from 4 threads in a clock, but only from one, same as nvidias "scalar" units. So a VLIW architecture by itself doesn't change anything to the "branching waste ratio".
What is true, is that it depends on the (logical) width of the SIMD units. That is what is called warp or wavefront size which has an influence as asserted in my first post on this topic. ;)

You are right when you say that medium/high end AMD cards have a larger wavefront size than nvidia GPUs. But traditionally AMD is flexible in that point and lower end GPUs had often smaller wavefront sizes (it's coupled to the width of the SIMDs ALU/TEX ratio, i.e. the ALU/TEX ratio since R700).
 
I thought otherwise, or to quote from the article linked in the starting post:
:p

Was referring to my post that you replied to that started this thread of conversation. Nothing to do with Vantage :)

One side of the branch being empty doesn't change anything in that matter.
And in principle, the threadgroup size is not a function of the VLIW size. A VLIW unit does not execute instructions from 4 threads in a clock, but only from one, same as nvidias "scalar" units. So a VLIW architecture by itself doesn't change anything to the "branching waste ratio".

We've started going in circles so let me restate my position again. A VLIW setup is more wasteful if the following equation is false: # cycles to complete task <= # cycles to complete task on scalar hardware / VLIW width. It's that simple and that's before taking wavefront sizes into consideration.
 
Maybe I don't understand the point, but how is a VLIW doing nothing more wasteful than scalar doing nothing?
 
Maybe I don't understand the point, but how is a VLIW doing nothing more wasteful than scalar doing nothing?
It isn't. That's my point.
Each thread (lane of an SIMD engine) masked for one side of a branch creates an idle unit. If 30% of the threads are masked, 30% of the units will sit idle. One wastes 30% of the total shader power, completely irrespective if you have VLIW or scalar units.
 
Last edited by a moderator:
We've started going in circles so let me restate my position again. A VLIW setup is more wasteful if the following equation is false: # cycles to complete task <= # cycles to complete task on scalar hardware / VLIW width. It's that simple and that's before taking wavefront sizes into consideration.
But that brings you nowhere as it completely neglects the increased area and power efficiency of the VLIW units. You have to set the losses into relation to the total available shader power. And then it is a wash for all your scenarios.

If a GPU with "scalar" SIMD engines :)lol:) looses 30% of it's available shader power due to a branch, an AMD style setup with SIMD engines (same width) consisting of VLIW units will also loose ~30% of it's available shader power in most scenarios. Otherwise you can always claim a smaller/thinner GPU is better as it will loose a lower amount of absolute flops. But that metric simply doesn't make much sense in this respect.

And that is valid for all kinds of shaders, also compute shaders ;)
 
But that brings you nowhere as it completely neglects the increased area and power efficiency of the VLIW units. You have to set the losses into relation to the total available shader power. And then it is a wash for all your scenarios.

If a GPU with "scalar" SIMD engines :)lol:) looses 30% of it's available shader power due to a branch, an AMD style setup with SIMD engines (same width) consisting of VLIW units will also loose ~30% of it's available shader power in most scenarios. Otherwise you can always claim a smaller/thinner GPU is better as it will loose a lower amount of absolute flops. But that metric simply doesn't make much sense in this respect.

And that is valid for all kinds of shaders, also compute shaders ;)

Ok I see what you mean, but in many case a branch block is actually not that long. Since AMD's VLIW can't span across blocks (as a VLIW bundle has only one predicate), it's utilization rate will be lower in such case. Compilers can sometimes help a bit though, such as in a case like:

Code:
if(condition) {
   do A
}
else {
   do B
}

do C

can be made into

Code:
if(condition) {
   do A and C
}
else {
   do B and C
}

to increase utilization.
 
Ok I see what you mean, but in many case a branch block is actually not that long. Since AMD's VLIW can't span across blocks (as a VLIW bundle has only one predicate), it's utilization rate will be lower in such case. Compilers can sometimes help a bit though, such as in a case like:
[..]
can be made into
[..]to increase utilization.
Basically there are three possibilities:

(i) very short blocks of code in the branches
=> What you just described (I mentioned it already here ;)) helps to increase the utilization, thus creating the opportunity that the VLIW architecture actually loses less than the scalar architecture.

(ii) longer blocks of code in the branches
=> both architectures lose the same amount of effective processing power

(iii) short blocks of code in the branches which cannot be transformed by the compiler
=> potential of a loss as there is only one branch per VLIW instruction, which means the VLIW instructions in the branches may use less slots than the average in other parts of the code.

But considering how probable it is that a branch consists of not more than two VLIW instruction (that effect disappears largely in the noise already with more than a single instruction as we approach scenario (ii)) and cannot be transformed by the compiler (I admit, AMD's shader compiler may be not the most clever one, but that is not the question here) I would conclude that the possibilities of (i) and (iii) cancel out each other.
 
But considering how probable it is that a branch consists of not more than two VLIW instruction (that effect disappears largely in the noise already with more than a single instruction as we approach scenario (ii)) and cannot be transformed by the compiler (I admit, AMD's shader compiler may be not the most clever one, but that is not the question here) I would conclude that the possibilities of (i) and (iii) cancel out each other.

Personally I don't think VLIW is a bad idea, as it's always a trade-off. It's just that 5D or even 4D VLIW seems to be a bit too "big" in my book. Of course, a 5D VLIW made sense when most shaders are of graphics nature, but now even games use a lot of non-graphics (at least not directly related) shaders for various effects. Right now AMD's OpenCL compiler is already quite good at finding ILP, and you can simply write a scalar program and get pretty good results (although in some case a vectorized version is still much more faster). So sometimes I think a 2D or 3D VLIW is probably a more manageable size.

This somehow makes GPU designing more and more like CPU designing, and that's not a bad thing IMHO. After all, what matters most is not peak performance nor efficiency per peak performance, but performance per cost and power.

In a sense, NVIDIA is also trying this path. In G8X there is a "mul" unit right after the normal unit, but it's rarely used in actual shaders. And later GF104/GF114 also have a similar "super-scalar" design, with a slightly less (sometimes much less) utilization than its scalar cousins. So I think it's safe to say that ILP within shaders is still a target worth pursuit. :)
 
4D may be a good stopping point for as long as AMD wants to keep it so that each ALU cluster can perform a most of the more complex graphics functions in a single clock, such as the multiple component vector operations and transcendtals (the setup code needed to keep the trig function arguments in the right range aside).
In that case, for the sake of simplifying scheduling by having enough hardware in the units, the VLIW offers the "bonus" of exposing the subcomponents that are normally hidden by a decoder at the front end.

Once the hardware is there, I think the rationale may be that it is a modest expense to add a bit of extra logic so that the control signals can be used to synthesize multiple scalar operations.

It may have a subtle effect on the utilization comparison between Nvidia and AMD (pre-Cayman).
The T-unit is lumped in as a full FLOP, while the missing mul is not counted for Nvidia. If Nvidia were VLIW, it would have exposed the hardware more fully.
 
Maybe I don't understand the point, but how is a VLIW doing nothing more wasteful than scalar doing nothing?

Well it's more wasteful in an absolute sense for sure but that's mitigated by higher density and unit counts. I agree to Gipsel's point that an idle VLIW unit is just as wasteful as an idle scalar unit relative to each architecture's theoretical throughput. However, when you consider the nops within active VLIW bundles the equation shifts in favor of the scalar setup - there's no such thing as a partially utilized scalar ALU. :)

A 16-wide scalar processor with 2 idle threads is at 87.5% utilization. A 16-wide VLIW5 with 2 idle threads is running at a maximum of 87.5% and minimum of 17.5% depending on VLIW occupancy. Reality is somewhere in between but it can never be higher than the scalar setup on average.
 
Personally I don't think VLIW is a bad idea, as it's always a trade-off. It's just that 5D or even 4D VLIW seems to be a bit too "big" in my book. Of course, a 5D VLIW made sense when most shaders are of graphics nature, but now even games use a lot of non-graphics (at least not directly related) shaders for various effects. Right now AMD's OpenCL compiler is already quite good at finding ILP, and you can simply write a scalar program and get pretty good results (although in some case a vectorized version is still much more faster). So sometimes I think a 2D or 3D VLIW is probably a more manageable size.
I dunno looks to me like 2D would be more "LIW" rather than "VLIW" :).
I think 4D could be a good compromise, though we don't really know how much die space a 4D vs. a 2D arrangement needs, if you include scheduling etc. It'll also allow for "natural" half-rate (unlike the VLIW5 arrangement) 2D FP64 if that should become more important (and if it's not important, allows for cheap 1D-2D FP64 operation as it is now).
 
Back
Top