AMD's VLIW approach doesn't change anything with that "null light" in the code. What is more important there is the wavefront or the warp size, respectively.The issue with VLIW is that when a unit is idle there are a lot of flops going to waste.
Take this snippet for example from DICE's BF3 DX11 presentation. I'm not sure how much work the intersects function does but in batches with active lights, any thread processing a "null" light is tossing a lot of flops away in a VLIW design. I'm assuming any batch within the work group that consists of all "null" lights will just branch around it at little cost.
I really doubt the average utilization is only 40%, i.e. 2/5 slots. Actually I think it is almost hard to find useful code with such a low utilization. The average is probably closer to 60%+.Actually your interpretation is pretty close. What it means is that on average you probably only see 2 useful operations per VLIW5 bundle. So you're talking about 40% utilization. Nvidia is probably closer to 80%.
Of course, the reality is that AMD also has WAY more shaders, so their performance overall looks quite good.
AMD's VLIW approach doesn't change anything with that "null light" in the code. What is more important there is the wavefront or the warp size, respectively.
Relatively speaking, both are wasting the same amount (*). In fact, if there are short and simple IF ELSE clauses which are transformed to conditional moves, the utilization of the VLIW units may peak (because the two branches are executed in parallel in a single thread). That can (partly) compensate for the wasted flops, no (or less) cycles are wasted.Of course it does. An idle AMD thread wastes 10 flops per cycle, an idle nVidia thread burns only 2.
GTX580: 512 ALUs @ 1550 MHz => 1.59 GFlop/s theoretical peakThere are certainly cases where it would balance out but on average we know that AMD's 4x(cayman) flops advantage per thread per clock doesn't translate into 4x fewer cycles to complete the same work. Normalizing for unit counts and clock speeds a 6970 should be somewhere around 1.7x faster than a 580 at compute stuff on average.
But that has nothing to do with my statement, that the VLIW architecture does not loose more relative shader power with branches than a scalar one
An even simpler example for that:
If (threadID is odd) {
do 100 flops
}else {
do some other 100 flops
}
Taking the branch is equally distributed, both GPUs are simply wasting half of the instructions because it executes always both halves and half of the SIMD lanes have to be masked out for each side of the branch. The utilization or the question scalar/VLIW doesn't change anything on the fact that both architectures waste half of their compute power. How much that may be or what restrictions apply simply doesn't matter.
if(some_condition(threadID)) {
do something
}
else {
do some other things
}
GTX580: 512 ALUs @ 1550 MHz => 1.59 GFlop/s theoretical peak
HD6970: 1536 ALUs @ 880 MHz => 2.70 GFlop/s theoretical peak
That means your x1.7 number is only valid for a workload perfectly fitting AMD's VLIW architecture. Otherwise it is significantly less (assuming a 3/4 utilization of the slots it gets reduced to <1.3). And in pure shader arithmetics, AMD's HD6970 is often at least as fast as nvidia's GTX580 (and it definitely needs less cycles for that ). But as I said before (second part), there is a lot more important stuff to get a high gaming performance than just shader power. It is simply not bottlenecked by shader power for quite a bit of the time needed to render a frame.
In a certain scenario, a HD6970 may need 10 ms for a complete frame, but is bottlenecked by the shaders for lets say only 3 ms of that. For the same workload, a GTX580 may spend 4 ms in a shader limited state. But that means not much for the total frame rendering time which may be only 9ms for the GTX580 and thus faster than the HD6970 despite the shader power disadvantage (which is of course highly dependent on the workload).
Gipsel said:But that has nothing to do with my statement, that the VLIW architecture does not loose more relative shader power with branches than a scalar one
An even simpler example for that:
If (threadID is odd) {
do 100 flops
}else {
do some other 100 flops
}
Taking the branch is equally distributed, both GPUs are simply wasting half of the instructions because it executes always both halves and half of the SIMD lanes have to be masked out for each side of the branch. The utilization or the question scalar/VLIW doesn't change anything on the fact that both architectures waste half of their compute power. How much that may be or what restrictions apply simply doesn't matter.
Which was exactly my point:But this is not very realistic. It's generally possible to "shuffle" half of these works into a continuous threadID group so there is no need to waste half FLOPs. However, if it's like this:
Code:if(some_condition(threadID)) { do something } else { do some other things }
In this case, a smaller warp size (or wavefront size) will be better on average.
AMD's VLIW approach doesn't change anything with that "null light" in the code. What is more important there is the wavefront or the warp size, respectively.
I thought otherwise, or to quote from the article linked in the starting post:Completely agree with the above but it's irrelevant since we're discussing compute only.
DK's article said:For performance, we will use the graphics score of 3DMark Vantage.
One side of the branch being empty doesn't change anything in that matter.The scenario I described in the DICE code is more like this:
If (condition) {
do 100 flops
}else {
do nothing
}
In that case your bigger VLIW threadgroups will waste a greater percentage of theoretical throughput.
I thought otherwise, or to quote from the article linked in the starting post:
One side of the branch being empty doesn't change anything in that matter.
And in principle, the threadgroup size is not a function of the VLIW size. A VLIW unit does not execute instructions from 4 threads in a clock, but only from one, same as nvidias "scalar" units. So a VLIW architecture by itself doesn't change anything to the "branching waste ratio".
It isn't. That's my point.Maybe I don't understand the point, but how is a VLIW doing nothing more wasteful than scalar doing nothing?
But that brings you nowhere as it completely neglects the increased area and power efficiency of the VLIW units. You have to set the losses into relation to the total available shader power. And then it is a wash for all your scenarios.We've started going in circles so let me restate my position again. A VLIW setup is more wasteful if the following equation is false: # cycles to complete task <= # cycles to complete task on scalar hardware / VLIW width. It's that simple and that's before taking wavefront sizes into consideration.
But that brings you nowhere as it completely neglects the increased area and power efficiency of the VLIW units. You have to set the losses into relation to the total available shader power. And then it is a wash for all your scenarios.
If a GPU with "scalar" SIMD engines lol looses 30% of it's available shader power due to a branch, an AMD style setup with SIMD engines (same width) consisting of VLIW units will also loose ~30% of it's available shader power in most scenarios. Otherwise you can always claim a smaller/thinner GPU is better as it will loose a lower amount of absolute flops. But that metric simply doesn't make much sense in this respect.
And that is valid for all kinds of shaders, also compute shaders
if(condition) {
do A
}
else {
do B
}
do C
if(condition) {
do A and C
}
else {
do B and C
}
Basically there are three possibilities:Ok I see what you mean, but in many case a branch block is actually not that long. Since AMD's VLIW can't span across blocks (as a VLIW bundle has only one predicate), it's utilization rate will be lower in such case. Compilers can sometimes help a bit though, such as in a case like:
[..]
can be made into
[..]to increase utilization.
But considering how probable it is that a branch consists of not more than two VLIW instruction (that effect disappears largely in the noise already with more than a single instruction as we approach scenario (ii)) and cannot be transformed by the compiler (I admit, AMD's shader compiler may be not the most clever one, but that is not the question here) I would conclude that the possibilities of (i) and (iii) cancel out each other.
Maybe I don't understand the point, but how is a VLIW doing nothing more wasteful than scalar doing nothing?
I dunno looks to me like 2D would be more "LIW" rather than "VLIW" .Personally I don't think VLIW is a bad idea, as it's always a trade-off. It's just that 5D or even 4D VLIW seems to be a bit too "big" in my book. Of course, a 5D VLIW made sense when most shaders are of graphics nature, but now even games use a lot of non-graphics (at least not directly related) shaders for various effects. Right now AMD's OpenCL compiler is already quite good at finding ILP, and you can simply write a scalar program and get pretty good results (although in some case a vectorized version is still much more faster). So sometimes I think a 2D or 3D VLIW is probably a more manageable size.