Wow, this is a trip back in time
I am not actually sure that 10 flops is correct anymore; possibly I got that from friends who used to develop on XBox (but I have no memory of being told this specifically) or more likely I assumed that it was the same as NV40/PS4(/NV30?) for which a lot more public information is available…
Digging a bit, they had separate Vec4 MADD/dot product and special function (reciprocal etc.) pipelines, but there are no reliable indications those could be co-issued, or that there was a separate scalar multiply-add. So it looks an awful lot like 8 flops to me rather than 10…!
I’d also caution against focusing too much on these numbers, because arguably the thing that mattered most were the programmable fixed point operations for the Pixel Shader, which are “ops” rather than “flops” but no less important.
—
BTW… since the time I first replied in this thread 15 years ago, I worked at Imagination/PowerVR, and there’s two weird flops-related factoids I can’t resist revealing, because it really highlights how difficult it is to get this kind of information right about some of the crazy exotic architectures from before everyone just standardised on scalar SIMT!
SGX-XT (543/544/554) was a SIMD architecture with dual-issue but a lot of constraints partly because it stuck to the 64-bit instruction encoding length of the original SGX, and partly due to register bandwidth/register type constraints, and some small but critical things that were missed because a proper compiler was only developed wayyy after the HW. Even though it encouraged FP16 and was much more power efficient and it was much easier to have decent efficiency in FP16, every ALU was FP32 capable!
There was a separate HW unit to calculate “x*x+y*y+z*z” for FP32/FP16 normalisation (i.e. 7 flops). It wasn’t well documented that it was separate from the main ALU, but under the right set of constraints, it could be co-issued with a Vec4 FP32 MADD! That means under the right magical unrealistic circumstances (I did see it happen once or twice across a lot of shaders I think hah), it could achieve up to 13 flops per cycle in FP32!
The public marketing was 9 flops (Vec4 + special function which was the much more common dual-issue, although even that was difficult to achieve in practice) - thankfully very few people knew about this (not sure even architecture team did tbh!) and the 13 flops was never used as a marketing number anywhere… As I said, even 9 flops was very optimistic, the average flops/clk for real workloads was much lower even after the compiler matured (and the original compiler that was used for the early customer evaluations practically didn’t even do any vectorisation, needless to say that didn’t go well against ARM/Mali).
Similarly, the FP16 unit on some later Rogue GPUs including 8XE and all subsequent XE cores supported a Vec4 multiply-add with a “1 minus” for one of the operands, i.e. “a*(1-b)+c” which is arguably 12 flops rather than 8, and it was/is actively used for some common blending modes, and the compiler will randomly use it when it finds an opportunity.
That whole FP16 pipeline (aka SOPMAD in the public instruction set documents which exclude most of the details) was crazy, I helped optimise the compiler for it a lot, which I’m quite proud of as it was a key factor in making 8XE a big success when Imagination really needed it at a very rough time for all the other product lines. It wasn’t perfect but had some really good things going for it. It’s still actively being sold in some low-end IMG GPUs so I probably shouldn’t go into too much detail though
—
I think it’s fair to say the industry has done the right thing by converging towards scalar SIMT though (with the reasonable exception of Vec2 FP16). As interesting as some of these architectures were, the performance efficiency loss is just too high relative to SIMT, and the latter is actually much less control logic for wide enough warps, so…
I believe there’s still a *huge* amount of innovation possible in terms of instruction sets and instruction fetch/decode/scheduling logic and NVIDIA for example is leaving a fair bit of efficiency on the table there right now. But SIMD definitely isn’t part of it, thank goodness!