Historical GPU arithmetic performance

I would much rather expect Extended Precision (32-bit mantissa, I-don't-remember-how-many-exponent-bits) to become standard rather than FP64, if we are to ever move away from FP32 (which I'm slightly skeptical about, frankly, unless we're thinking more than 5 years down the road?) - this is because it allows significant synergies with 32-bit INT for the execution units, although data paths still need to be wider obviously.
Isn't this how they implement DirectX 10 float and int support already? I can't imagine having separate units for them. Or are integers really stored in floats and thus limited to a 24-bit integer range?
 
I would much rather expect Extended Precision (32-bit mantissa, I-don't-remember-how-many-exponent-bits) to become standard rather than FP64, if we are to ever move away from FP32 (which I'm slightly skeptical about, frankly, unless we're thinking more than 5 years down the road?) - this is because it allows significant synergies with 32-bit INT for the execution units, although data paths still need to be wider obviously.

Extended precision would mean non power of two alignment in memory.
Loading a value from memory wouldn't be quite as simple, or data structures would be padded to maintain alignment at the cost of space.
 
Would it be good having FP precised monitors also? (I can't believe we can see just 16M colors at max)

well there's allways analog crt
also the matrox parhelia supported 10bit per colour output gigacolour they called it
and my 9800pro (maybe others) apparently had 10bit dacs but I never found out how to enable 30bit colour or 40bit if you include the alpha channel
 
well there's allways analog crt
also the matrox parhelia supported 10bit per colour output gigacolour they called it
and my 9800pro (maybe others) apparently had 10bit dacs but I never found out how to enable 30bit colour or 40bit if you include the alpha channel
The mode was A2R10G10B10 and was available to full-screen D3D applications, not sure about OpenGL. Destination alpha is rarely used, so only having 2 bits is not a big deal. Lost Planet is one game supports 2101010, I'm sure there are others (TRAOD did ages ago).
 
Megadrive: Actually, in terms of programmable flops, NV2A is only 4.66GFlops! (2 vertex pipelines * 5 units * 2 Ops via MADD * 233MHz) - I don't know where that number comes from, maybe it counts the FP32 texture addressing in the pixel shaders or the FP-based triangle setup engine? Either way that is not exposed to the programmer.

Crystall: Yes, I actually corrected that in a later post, but I probably should have edited my original one too - oops! :) Done that now, cheers.
Hi, sorry for reviving such an old thread (I'm 15 years late, I know).
I was looking around the internet about the 20GFLOPS number of the original Xbox and I never knew where it came from, but I thought it was a bit excessive. The 4.66GFLOPS number seems more likely, and I now know that the GeForce3 architecture didn't have a completely programmable Pixel Pipeline, so, we only count the Vertex pipeline FLOPS.

There is only one thing I need to know, but where does the 10Flops per cycle comes from? Is there any source to it? Some users at Wikipedia doesn't want the number to be put there because there is not a reliable way to achieve it (they don't count "random" (in their words) forums as a source), but if we can get from some trusted site the 10FLOPS per cycle figure, then the calculation is just simple math to achieve the 4.66GFLOPS figure.

Thanks, and sorry if it's not allowed to ask in such an old thread, I'm aware it has been a really long time since any activity happend here.
 
There is only one thing I need to know, but where does the 10Flops per cycle comes from? Is there any source to it? Some users at Wikipedia doesn't want the number to be put there because there is not a reliable way to achieve it (they don't count "random" (in their words) forums as a source), but if we can get from some trusted site the 10FLOPS per cycle figure, then the calculation is just simple math to achieve the 4.66GFLOPS figure.
The common Vertex ALU architecture in the old GPUs could handle vec4 + scalar MADD op's, that would make 5xADD + 5xMUL for a total of 10 FLOPs per cycle.

This source should be reputable for Wikipedia enough: https://link.springer.com/chapter/10.1007/978-3-031-14047-1_1
 
Wow, this is a trip back in time :)

I am not actually sure that 10 flops is correct anymore; possibly I got that from friends who used to develop on XBox (but I have no memory of being told this specifically) or more likely I assumed that it was the same as NV40/PS4(/NV30?) for which a lot more public information is available…

Digging a bit, they had separate Vec4 MADD/dot product and special function (reciprocal etc.) pipelines, but there are no reliable indications those could be co-issued, or that there was a separate scalar multiply-add. So it looks an awful lot like 8 flops to me rather than 10…!

I’d also caution against focusing too much on these numbers, because arguably the thing that mattered most were the programmable fixed point operations for the Pixel Shader, which are “ops” rather than “flops” but no less important.



BTW… since the time I first replied in this thread 15 years ago, I worked at Imagination/PowerVR, and there’s two weird flops-related factoids I can’t resist revealing, because it really highlights how difficult it is to get this kind of information right about some of the crazy exotic architectures from before everyone just standardised on scalar SIMT!

SGX-XT (543/544/554) was a SIMD architecture with dual-issue but a lot of constraints partly because it stuck to the 64-bit instruction encoding length of the original SGX, and partly due to register bandwidth/register type constraints, and some small but critical things that were missed because a proper compiler was only developed wayyy after the HW. Even though it encouraged FP16 and was much more power efficient and it was much easier to have decent efficiency in FP16, every ALU was FP32 capable!

There was a separate HW unit to calculate “x*x+y*y+z*z” for FP32/FP16 normalisation (i.e. 7 flops). It wasn’t well documented that it was separate from the main ALU, but under the right set of constraints, it could be co-issued with a Vec4 FP32 MADD! That means under the right magical unrealistic circumstances (I did see it happen once or twice across a lot of shaders I think hah), it could achieve up to 13 flops per cycle in FP32!

The public marketing was 9 flops (Vec4 + special function which was the much more common dual-issue, although even that was difficult to achieve in practice) - thankfully very few people knew about this (not sure even architecture team did tbh!) and the 13 flops was never used as a marketing number anywhere… As I said, even 9 flops was very optimistic, the average flops/clk for real workloads was much lower even after the compiler matured (and the original compiler that was used for the early customer evaluations practically didn’t even do any vectorisation, needless to say that didn’t go well against ARM/Mali).

Similarly, the FP16 unit on some later Rogue GPUs including 8XE and all subsequent XE cores supported a Vec4 multiply-add with a “1 minus” for one of the operands, i.e. “a*(1-b)+c” which is arguably 12 flops rather than 8, and it was/is actively used for some common blending modes, and the compiler will randomly use it when it finds an opportunity.

That whole FP16 pipeline (aka SOPMAD in the public instruction set documents which exclude most of the details) was crazy, I helped optimise the compiler for it a lot, which I’m quite proud of as it was a key factor in making 8XE a big success when Imagination really needed it at a very rough time for all the other product lines. It wasn’t perfect but had some really good things going for it. It’s still actively being sold in some low-end IMG GPUs so I probably shouldn’t go into too much detail though :)



I think it’s fair to say the industry has done the right thing by converging towards scalar SIMT though (with the reasonable exception of Vec2 FP16). As interesting as some of these architectures were, the performance efficiency loss is just too high relative to SIMT, and the latter is actually much less control logic for wide enough warps, so…

I believe there’s still a *huge* amount of innovation possible in terms of instruction sets and instruction fetch/decode/scheduling logic and NVIDIA for example is leaving a fair bit of efficiency on the table there right now. But SIMD definitely isn’t part of it, thank goodness!
 
Back
Top