Predicting GPU Performance for AMD and Nvidia

there's no such thing as a partially utilised scalar ALU.
So, if we have a scalar ALU with fully pipelined FP multiplier and FP adder units, and we perform a multiply floating-point multiply instruction, what is the utilisation of the adder?
 
So sometimes I think a 2D or 3D VLIW is probably a more manageable size.
But then you will be forced either to double the wavefront size (which is bad for obvious reasons) or you have to double the amount of scheduling hardware, which cost probably more area and power for the same performance gain than just adding a few more SIMD engines.
You are completely right here:
After all, what matters most is not peak performance nor efficiency per peak performance, but performance per cost and power.

A 16-wide scalar processor with 2 idle threads is at 87.5% utilization. A 16-wide VLIW5 with 2 idle threads is running at a maximum of 87.5% and minimum of 17.5% depending on VLIW occupancy. Reality is somewhere in between but it can never be higher than the scalar setup on average.
It doesn't have to be. A 60% to 70% of VLIW utilization is completely enough to have a higher shader performance in the same or lower area/power envelope. And that is easy to get on average.

If you accept the workload dependent variations of the VLIW utilization as some kind of inherent feature of the architecture to use the ILP, you will see that two idle threads in a 16-wide SIMD engine always gives you two idle units (irrespective of the architecture). If the average VLIW utilization for the code in question will be x/5 (or x/4), the scalar units will be idle for (at least) x times as many cycles as the VLIW units. It simply doesn't matter how many threads are idle (or if they are idle at all), the (code dependent) relation of the shader power won't change by idle threads.
 
So, if we have a scalar ALU with fully pipelined FP multiplier and FP adder units, and we perform a multiply floating-point multiply instruction, what is the utilisation of the adder?

The throughput of the unit is one instruction per clock and that multiply satisfies my definition of 100% utilization. Looking at which specific transistors are idling is entering the realm of extreme silliness :) Of course you can choose to define throughput in flops which would only give you 100% in the case of an all MADD workload.

@Gipsel, yeah when you include practical factors like measured performance and die-size you get a different picture.
 
The throughput of the unit is one instruction per clock and that multiply satisfies my definition of 100% utilization. Looking at which specific transistors are idling is entering the realm of extreme silliness :)
Perhaps you aren't cut out for the world of performance per clock:)
Seriously, though, my post was in response to debate that VLIW would be underutilised yet scalar systems fully utilised. In terms of 'instructions', that may seem to be the situation but, perhaps in terms of what is really being used in hardware, it might not be that different at all.
 
Seriously, though, my post was in response to debate that VLIW would be underutilised yet scalar systems fully utilised. In terms of 'instructions', that may seem to be the situation but, perhaps in terms of what is really being used in hardware, it might not be that different at all.
If you look just at the ALUs, a VLIW architecture will have more transistors sitting idle on average. ALUs in AMDs VLIW units have basically almost the same functionality as necessary for a scalar architecture. So if you have a multiplication operation in one slot, the adder sits idle, same as with nvidia. But if you have not completely filled VLIW-instructions, the complete ALU stays idle (which is the case for roughly 25% of the ALUs on average). This means you have already a lowered baseline for VLIW so to speak. But this is mitigated by the fact that you need to waste a lot less transistors in the scheduling and data distribution hardware.
 
Guys, utilization alone is a worthless metric.

Here's a better way to look at it: Cayman and GF114 are almost the same size, are both 256-bit, and both have 384 shader units. The difference is that each Cayman unit can do up to 4 ops per clock @900MHz, if they can fit, while the GF114 can do 1 op per clock at @1645 MHz. Saying that a scalar stream results in 25% utilization on AMD and therefore it is inefficient is disingenuous.

I think AMD made the right decision. Their problem is not with the shader architecture. It's with either geometry or state changes.
 
Back
Top