Historical GPU FlOPs performance

Also you can prolly argue that before GT200, NVidia's unified GPUs could only issue a MAD per clock, whereas from GT200 onwards it's MAD+MUL. Hence 346GFLOPs for 8800GTX.
And even that is too simple if you wanted to be perfectly honest: G80 can use half the MUL in CUDA and none in 3D, while GT200 can use all the MUL in CUDA but only half in 3D. Yay? :D
Also, oops, I realized I made a mistake wrt R300's PS and forgot the free ADD; so just multiply that by 1.5x!
 
John Owens at UC Davis has historical data and a graph for this that he shows regularly.
 
Aren't they dual precision numbers? I thought all those chips were double that in single precision.
Only the Core 2 numbers should be doubled. They can execute MUL and ADD in parallel while Pentiums only have a single execution port for both (and only half the SIMD width).
 
Only the Core 2 numbers should be doubled. They can execute MUL and ADD in parallel while Pentiums only have a single execution port for both (and only half the SIMD width).
The P4 number needs doubling too as it is able to execute 2 MULs and 2 ADDs per cycle (single precision). Having a single execution port for SSEx instructions doesn't become a bottleneck (if the MULs and ADDs are interleaved) as the execution pipes (MUL and ADD) only accepts one 128-bit operation every two cycles.
 
According to the book "Opening The Xbox', page 270
"The Xbox had 21.6 gigaflops of computing power"

Obviously the lion's share of that came from the NV2x based NV2A GPU, and, I believe that roughly ~20 or so GFLOPs is a reasonably accurate figure (and thus not 'NvFlops'). Compared with what Microsoft originally anounced the Xbox GPU would do at GDC 2000, they said 140 GFLOPs. In 2001 when the final downgraded Xbox GPU was done, Nvidia said NV2A does 80 GFLOPs.

So obviously GeForce 3's 76 GFLOPs is 'NvFlops'. Given that GeForce 3 (NV20) only has 1 vertex shader, it's only going to produce around 1/2 the flops per clockcycle as the NV2A which as 2 vertex shaders. So something like 10 GFLOPs might be more accurate.

The Flipper GPU in GameCube produces around 8.6 GFLOPs, given that the entire GameCube is rated at 10.5 GFLOPs and the Gekko CPU does 1.9 GFLOPs.
 
Megadrive: Actually, in terms of programmable flops, NV2A is only 4.66GFlops! (2 vertex pipelines * 5 units * 2 Ops via MADD * 233MHz) - I don't know where that number comes from, maybe it counts the FP32 texture addressing in the pixel shaders or the FP-based triangle setup engine? Either way that is not exposed to the programmer.

Crystall: Yes, I actually corrected that in a later post, but I probably should have edited my original one too - oops! :) Done that now, cheers.
 
NV30 is terrible when it comes to flops. You really need to know what do you count into this (shader flops, texture filtering flops, rop flops,...) to make ANY sense out of it. And even then it's more apples and oranges then anything else.
What I found interesting with NV3x was the similar performance of 5200 vs 5600 at the same clock speed, esp. given the ~47m vs ~80m trans difference. (NV36 sorted the VS & clock issues of NV31).
 
Radeon 8500 had two VS engines, but I can't find whether they were Vec4 or Vec5 anywhere; presumably the latter like R300+. Same as for NV2x PS-wise though, 0 flops... Radeon 9000 was the same but only 1 VS engine.
Note that those VS engines aren't "true" Vec5. There's a Vec4 unit and a scalar unit. R200 cannot dual issue an instruction to both at the same time, thus it's irrelevant for a peak flop number. R300-R500 can dual issue to both, however for peak flop it's still irrelevant since it can't dual issue when the vector engine executes a mad (since the encoding of the scalar unit is done in the 3rd source operand field for dual-issue). Scalar unit is limited to 2 source operands (so only mul and no mad) too.
 
The P4 number needs doubling too as it is able to execute 2 MULs and 2 ADDs per cycle (single precision). Having a single execution port for SSEx instructions doesn't become a bottleneck (if the MULs and ADDs are interleaved) as the execution pipes (MUL and ADD) only accepts one 128-bit operation every two cycles.
I've heard that claim before, but never found a conclusive confirmation. Indeed having multiple execution units bound to the same port doesn't prevent them from simultaneous interleaved execution, but according to the Intel documentation FP-MUL and FP-ADD are subunits within the same FP execution unit. Subunits have no separate control as far as I know. So the ADDPS and MULPS instructions can't execute simultaneously like that. Interestingly this doesn't apply to the integer SSE instructions. PADDW executes on the MMX-ALU subunit which is part of the MMX execution unit (sharing the same port as the FP unit) while PMULLW executes on the FP-MUL subunit.
 
What I found interesting with NV3x was the similar performance of 5200 vs 5600 at the same clock speed, esp. given the ~47m vs ~80m trans difference. (NV36 sorted the VS & clock issues of NV31).
I've never had a good answer to that, FWIW; I'm sure the "off-shader" parts were sufficiently different to save a lot of transistors (things like compression but not just that), but still... It might ironically indicate NV30's performance/mm² inefficiency was not where you would expect it to be.

mczak: Very interesting, never knew that - cheers! :)
 
I've heard that claim before, but never found a conclusive confirmation. Indeed having multiple execution units bound to the same port doesn't prevent them from simultaneous interleaved execution, but according to the Intel documentation FP-MUL and FP-ADD are subunits within the same FP execution unit. Subunits have no separate control as far as I know.
IIRC the subunits can execute different instructions in parallel (i.e. they have separate control). Anyway I can try this theory out as I've got a few P4s lying around, both Northwood and Prescott.
So the ADDPS and MULPS instructions can't execute simultaneously like that. Interestingly this doesn't apply to the integer SSE instructions. PADDW executes on the MMX-ALU subunit which is part of the MMX execution unit (sharing the same port as the FP unit) while PMULLW executes on the FP-MUL subunit.
That seems to confirm my memories that the different subunits can execute independently.
 
IIRC the subunits can execute different instructions in parallel (i.e. they have separate control). Anyway I can try this theory out as I've got a few P4s lying around, both Northwood and Prescott.
That would be great, thanks.
That seems to confirm my memories that the different subunits can execute independently.
No, note that MMX-ALU and FP-MUL are part of different execution units (namely MMX and FP), while FP-MUL and FP-ADD are part of the same execution unit. So it only confirms that execution units bound to the same port can execute independently, not subunits within an execution unit.
 
I've heard that claim before, but never found a conclusive confirmation. Indeed having multiple execution units bound to the same port doesn't prevent them from simultaneous interleaved execution, but according to the Intel documentation FP-MUL and FP-ADD are subunits within the same FP execution unit. Subunits have no separate control as far as I know. So the ADDPS and MULPS instructions can't execute simultaneously like that. Interestingly this doesn't apply to the integer SSE instructions. PADDW executes on the MMX-ALU subunit which is part of the MMX execution unit (sharing the same port as the FP unit) while PMULLW executes on the FP-MUL subunit.
Not sure how but running fpmul and fpadd simultaneously should indeed work. From a p4.pdf (found it here: http://www-cse.ucsd.edu/classes/sp02/cse241/p4.pdf)
"Many FP/multi-media applications have a fairly balanced set of multiplies and adds. The machine can usually keep busy interleaving a multiply and an add every two clock cycles at much less cost than fully pipelining all the FP/SSE execution hardware. In the Pentium 4 processor, the FP adder can execute one Extended-Precision (EP) addition, one Double-Precision (DP) addition, or two Single-Precision (SP) additions every clock cycle. This allows it to complete a 128-bit SSE/SSE2 packed SP or DP add uop every two clock cycles. The FP multiplier can execute either one EP multiply every two clocks, or it can execute one DP multiply or two SP multiplies every clock.
This allows it to complete a 128-bit IEEE SSE/SSE2 packed SP or DP multiply uop every two clock cycles giving a peak 6 GFLOPS for single precision or 3 GFLOPS for double precision floating-point at 1.5GHz. "
Otherwise the numbers wouldn't add up...
 
Back
Top