Historical GPU arithmetic performance

Arun · Jan 12, 2009

Jawed said:
Also you can prolly argue that before GT200, NVidia's unified GPUs could only issue a MAD per clock, whereas from GT200 onwards it's MAD+MUL. Hence 346GFLOPs for 8800GTX.

And even that is too simple if you wanted to be perfectly honest: G80 can use half the MUL in CUDA and none in 3D, while GT200 can use all the MUL in CUDA but only half in 3D. Yay?

Also, oops, I realized I made a mistake wrt R300's PS and forgot the free ADD; so just multiply that by 1.5x!

mhouston · Jan 12, 2009

John Owens at UC Davis has historical data and a graph for this that he shows regularly.

Nick · Jan 12, 2009

pjbliverpool said:
Aren't they dual precision numbers? I thought all those chips were double that in single precision.

Only the Core 2 numbers should be doubled. They can execute MUL and ADD in parallel while Pentiums only have a single execution port for both (and only half the SIMD width).

KonKort · Jan 12, 2009

Well, look: Here is a list of all Nvidia and ATI GPUs since Geforce 2 / Radeon 7000 with Flops.
But do not compare the values 1:1. There are many architecture differences.

Nvidia list

ATI list

caboosemoose · Jan 13, 2009

@ Kon Kort

Thanks - that is exactly what I was after.

Davros · Jan 13, 2009

are the numbers for the p4's with hyperthreading taken into account ?

crystall · Jan 13, 2009

Nick said:
Only the Core 2 numbers should be doubled. They can execute MUL and ADD in parallel while Pentiums only have a single execution port for both (and only half the SIMD width).

The P4 number needs doubling too as it is able to execute 2 MULs and 2 ADDs per cycle (single precision). Having a single execution port for SSEx instructions doesn't become a bottleneck (if the MULs and ADDs are interleaved) as the execution pipes (MUL and ADD) only accepts one 128-bit operation every two cycles.

crystall · Jan 13, 2009

Arun said:
R300 had 4 VS, but the PS had 8*4*2 flops available to it (FP24 obviously).

Wasn't the R300 able to do 4 MADs and 4 ADDs per quad? That should make it 8*4*3. That holds true for R400 and R500 IIRC.

Megadrive1988 · Jan 13, 2009

According to the book "Opening The Xbox', page 270

"The Xbox had 21.6 gigaflops of computing power"

Obviously the lion's share of that came from the NV2x based NV2A GPU, and, I believe that roughly ~20 or so GFLOPs is a reasonably accurate figure (and thus not 'NvFlops'). Compared with what Microsoft originally anounced the Xbox GPU would do at GDC 2000, they said 140 GFLOPs. In 2001 when the final downgraded Xbox GPU was done, Nvidia said NV2A does 80 GFLOPs.

So obviously GeForce 3's 76 GFLOPs is 'NvFlops'. Given that GeForce 3 (NV20) only has 1 vertex shader, it's only going to produce around 1/2 the flops per clockcycle as the NV2A which as 2 vertex shaders. So something like 10 GFLOPs might be more accurate.

The Flipper GPU in GameCube produces around 8.6 GFLOPs, given that the entire GameCube is rated at 10.5 GFLOPs and the Gekko CPU does 1.9 GFLOPs.

Arun · Jan 13, 2009

Megadrive: Actually, in terms of programmable flops, NV2A is only 4.66GFlops! (2 vertex pipelines * 5 units * 2 Ops via MADD * 233MHz) - I don't know where that number comes from, maybe it counts the FP32 texture addressing in the pixel shaders or the FP-based triangle setup engine? Either way that is not exposed to the programmer.

Crystall: Yes, I actually corrected that in a later post, but I probably should have edited my original one too - oops!

Done that now, cheers.

stevem · Jan 13, 2009

MDolenc said:
NV30 is terrible when it comes to flops. You really need to know what do you count into this (shader flops, texture filtering flops, rop flops,...) to make ANY sense out of it. And even then it's more apples and oranges then anything else.

What I found interesting with NV3x was the similar performance of 5200 vs 5600 at the same clock speed, esp. given the ~47m vs ~80m trans difference. (NV36 sorted the VS & clock issues of NV31).

mczak · Jan 13, 2009

Arun said:
Radeon 8500 had two VS engines, but I can't find whether they were Vec4 or Vec5 anywhere; presumably the latter like R300+. Same as for NV2x PS-wise though, 0 flops... Radeon 9000 was the same but only 1 VS engine.

Note that those VS engines aren't "true" Vec5. There's a Vec4 unit and a scalar unit. R200 cannot dual issue an instruction to both at the same time, thus it's irrelevant for a peak flop number. R300-R500 can dual issue to both, however for peak flop it's still irrelevant since it can't dual issue when the vector engine executes a mad (since the encoding of the scalar unit is done in the 3rd source operand field for dual-issue). Scalar unit is limited to 2 source operands (so only mul and no mad) too.

Nick · Jan 13, 2009

Davros said:
are the numbers for the p4's with hyperthreading taken into account ?

HyperThreading doesn't affect peak performance. It merely lets two threads share the same execution units.

Davros · Jan 13, 2009

ahh, yes now ive thought about it a bit more i see

Nick · Jan 13, 2009

crystall said:
The P4 number needs doubling too as it is able to execute 2 MULs and 2 ADDs per cycle (single precision). Having a single execution port for SSEx instructions doesn't become a bottleneck (if the MULs and ADDs are interleaved) as the execution pipes (MUL and ADD) only accepts one 128-bit operation every two cycles.

I've heard that claim before, but never found a conclusive confirmation. Indeed having multiple execution units bound to the same port doesn't prevent them from simultaneous interleaved execution, but according to the Intel documentation FP-MUL and FP-ADD are subunits within the same FP execution unit. Subunits have no separate control as far as I know. So the ADDPS and MULPS instructions can't execute simultaneously like that. Interestingly this doesn't apply to the integer SSE instructions. PADDW executes on the MMX-ALU subunit which is part of the MMX execution unit (sharing the same port as the FP unit) while PMULLW executes on the FP-MUL subunit.

Arun · Jan 13, 2009

stevem said:
What I found interesting with NV3x was the similar performance of 5200 vs 5600 at the same clock speed, esp. given the ~47m vs ~80m trans difference. (NV36 sorted the VS & clock issues of NV31).

I've never had a good answer to that, FWIW; I'm sure the "off-shader" parts were sufficiently different to save a lot of transistors (things like compression but not just that), but still... It might ironically indicate NV30's performance/mm² inefficiency was not where you would expect it to be.

mczak: Very interesting, never knew that - cheers!

crystall · Jan 13, 2009

Nick said:
I've heard that claim before, but never found a conclusive confirmation. Indeed having multiple execution units bound to the same port doesn't prevent them from simultaneous interleaved execution, but according to the Intel documentation FP-MUL and FP-ADD are subunits within the same FP execution unit. Subunits have no separate control as far as I know.

IIRC the subunits can execute different instructions in parallel (i.e. they have separate control). Anyway I can try this theory out as I've got a few P4s lying around, both Northwood and Prescott.

So the ADDPS and MULPS instructions can't execute simultaneously like that. Interestingly this doesn't apply to the integer SSE instructions. PADDW executes on the MMX-ALU subunit which is part of the MMX execution unit (sharing the same port as the FP unit) while PMULLW executes on the FP-MUL subunit.

That seems to confirm my memories that the different subunits can execute independently.

mczak · Jan 13, 2009

Arun said:
mczak: Very interesting, never knew that - cheers!

Well, this stuff isn't a big secret anymore - you can find all that information here: http://www.x.org/docs/AMD/R5xx_Acceleration_v1.3.pdf. It also has sections covering differences to R3xx and even R2xx (well for VS at least) so you can see how it evolved.

Nick · Jan 13, 2009

crystall said:
IIRC the subunits can execute different instructions in parallel (i.e. they have separate control). Anyway I can try this theory out as I've got a few P4s lying around, both Northwood and Prescott.

That would be great, thanks.

That seems to confirm my memories that the different subunits can execute independently.

No, note that MMX-ALU and FP-MUL are part of different execution units (namely MMX and FP), while FP-MUL and FP-ADD are part of the same execution unit. So it only confirms that execution units bound to the same port can execute independently, not subunits within an execution unit.

mczak · Jan 13, 2009

Nick said:
I've heard that claim before, but never found a conclusive confirmation. Indeed having multiple execution units bound to the same port doesn't prevent them from simultaneous interleaved execution, but according to the Intel documentation FP-MUL and FP-ADD are subunits within the same FP execution unit. Subunits have no separate control as far as I know. So the ADDPS and MULPS instructions can't execute simultaneously like that. Interestingly this doesn't apply to the integer SSE instructions. PADDW executes on the MMX-ALU subunit which is part of the MMX execution unit (sharing the same port as the FP unit) while PMULLW executes on the FP-MUL subunit.

Not sure how but running fpmul and fpadd simultaneously should indeed work. From a p4.pdf (found it here: http://www-cse.ucsd.edu/classes/sp02/cse241/p4.pdf)
"Many FP/multi-media applications have a fairly balanced set of multiplies and adds. The machine can usually keep busy interleaving a multiply and an add every two clock cycles at much less cost than fully pipelining all the FP/SSE execution hardware. In the Pentium 4 processor, the FP adder can execute one Extended-Precision (EP) addition, one Double-Precision (DP) addition, or two Single-Precision (SP) additions every clock cycle. This allows it to complete a 128-bit SSE/SSE2 packed SP or DP add uop every two clock cycles. The FP multiplier can execute either one EP multiply every two clocks, or it can execute one DP multiply or two SP multiplies every clock.
This allows it to complete a 128-bit IEEE SSE/SSE2 packed SP or DP multiply uop every two clock cycles giving a peak 6 GFLOPS for single precision or 3 GFLOPS for double precision floating-point at 1.5GHz. "
Otherwise the numbers wouldn't add up...

Historical GPU arithmetic performance

Arun

Unknown.

mhouston

A little of this and that

Nick

KonKort

caboosemoose

Davros

crystall

crystall

Megadrive1988

Arun

Unknown.

stevem

mczak

Nick

Davros

Nick

Arun

Unknown.

crystall

mczak

Nick

mczak

Similar threads