Heinrich it's not even about raw power there is a significant difference in marketing FLOPS and the really world.
Putting some architectural difference away we have 3 type of FLOPS in the GPU world.
count FLOPS with:
vliw5
vliw4
scalar (Nv and GCN).
As I was saying to KB-somker about site that swallow marketing cool aid wrt to diminishing FLOPS peak figures in late AMD GPU and speak of efficiency because I'm iffy if they understand the difference in design.
The difference is not really efficiency it's a sound architectural difference. You can literally remove 20% (possibly a bit more on average) of the peak of a vliw 5 design vs scalar so from 2.1 for example you go down to 1.68TFLOPS for example.
It's not efficiency it's design.
In previous AMD VLIW 5 GPUs, the base of the design is not the Stream processor as touted by marketing materials... but a group of 5 ALUs.
All those ALUs are not equal, you have 4 std ALUs and the special one (in charge of trigonometry, etc. the RYS unit as people call it here
).
A SIMD is not 80 SP acting in vectorized fashion, not at all.
It's indeed 16 5 wide units acting in a vectorized fashion, hence that's why hardware.fr /behardware.com calls them 16 Vec5 (16 Vec4 for cayman).
In fact those VLIW blocks are organized in bigger blocks, quads , of VLIW5 units wrt regard to the register files (massive amount of register files).
So now those 5 wide block are MIMD units acting in a VLIW fashion. So it's up to the compiler to extract parallelism and make sure that those units are busy.
So all the 16 blocks in a SIMD receive them VLIW instructions but work on different data.
At this point you will notice that most of the work done by GPU is on four elements basically the Fifth alus the Transcendental one is there not be used all the time but to make sure specific operations execute fast. It was a cheap way to achieve that (the fifth ALU) as ALU are cheap.
On average utilization of a vliw5 units is 3.8 instructions per cycle I believe (should check but that a bit below 4).
There are reason in graphic workload forthat matter of fact but that are also architectural one, you five ALU and register can be accessed by all of them at the same time. This is complicated but there are plenty of posts in the forum that explains why it set a limitation to the design.
That the reason why AMD moved first to a VLIW4 design. It removes headache for the register ports design (that's if I understand properly...
) as well as the pressure it creates on the compiler.
There was a trade off trancendental operation are slower. Thing is on graphic workload the IPC is mostly the same as in a VLIW5 design (and still below 4)
that's where marketing FLOPS kicks in, the FLOPS were always given including the T unit and based on the SP numbers as if they were equal, BS so called hardware sites don't give a shit they care for clicks not accuracy. It was an irrelevant figure. Now AMD moved from it and using marketing parlance for people that are not interested in tech ( I can understand some people are just gamers and is not a sin). In marketing parlance it's "more efficient".
Whereas this diminishing number of SP and FLOPS as no impact or really marginal on the design but when you fed people SP and FLOPS as metric for performance you have to come with something.
For me it's not efficiency the compiler can't extract more ILP (instruction level parallelism) with the new design, it's just easier to avoid conflict in register access, neither can the hardware VLIW is by design as dumb as can be.
Then why AMD moved to Scalar / pure SIMD design as Nvidia? Because in some situation (not happens much in graphics) the IPC that can be extracted by the compiler is way below 3.8.
By design it can go as low 1 or 2. In effect using marketing parlance you don't have 80 SP in your SIMD but respectively 16 and 32... massive hit in efficiency.
That's whay AMD moved to a scalar?plain SIMD design. Extracting ILP in graphic were pretty easy but as compute get relevant it turns into a double edge sword. there are case where there is simply niot that much ILP to be extracted when it happens the architecture (even refined VLIW4) fails.
AMD gave up with GCN on leveraging Instruction Level parallelism, that;s it.
A nice effect is that comparing a CU so 64 ALUs and a cayman SIMD 64 ALUS too, the former always achieve 64 operation per cycle (it's an incorrect way to put it but that's pretty much the figure looking from the distance) whereas on average Cayman SIMD will do 3.8*16 in graphic and can end up well below in other cases. Even a VLIW 5 design would not beat GCN it would push 3.8*16 pretty much as cayman.
EDIT ALUs utilization is a more correct way to put it as instructions are likely to take more than one cycle to execute. So you have 100% ALUs utilization on one side (plain SIMD) amd3.8/5*100 or 3.8/4*100 on the other
/EDIT
I've no clear understanding of those low level stuffs but that's the best description I can give you about "not taking marketing and FLOPS figures... at their face value". Hope it helps.
Others members can correct approximations /or things I would get wrong if they want or provide even more information.