Yes, but I was thinking lately maybe efficiency drops if you have "too many" simds per shader engine.
Technically possible, since a shader engine has a limited number of hardware threads in flight and those limits probably get split according to possible shader types. Never seen any tests exploring this.
And if you don't like the 3 shader engines, what about 4 instead? Though I agree only 6 per dispatch processor would be quite low - I want the chip to have 28 simds in a 4x7 arrangement, with the pro being 4x6 instead
. (Barts also has only 7 simds in a group, though they are of course VLIW-5.)
4 would be fine. Though if Pro has 22 SIMDs, that's a bust.
This makes sense. Even if you assume you could get the same performance out of a VLIW-4 simd compared to a VLIW-5 (which is a bit of a stretch) 24 is only 20% more simds however, so performance improvements beyond that have to come from elsewhere. Also note there's an obvious difference between utilization of alu slots and alu instructions issued per clock - since transcendentals now require 3 slots even serial dependent transcendentals have 75% utilization - but obviously they aren't any faster than the 20% utilization of the same sequence in Evergreen.
Agreed to all that.
The utilisation question is a bit thorny with this change. e.g. referring back to the clause that forms the body of the loop in the code Gipsel and I were playing with:
- Cypress 38 scalar ops in 16 cycles = 48% utilisation
- Cayman 38 scalar ops in 19 cycles = 50% utilisation
- Cayman 64 scalar ops (including all the portions of transcendentals) in 19 cycles = 84% utilisation
With more complex shaders it's going to be fairly fiddly to pick-apart the transcendental. But then it was fiddly to pick out the DOT4s that were being used for DOT3. And in truth, there was practically no-one who was counting
So architectural balance is the name of the game, and VLIW-4 seems better. Still hard to say whether the FLOPS/mm² has suffered when considering the entire area dedicated to cores (i.e. including all scheduling overheads).
There's still no resolution to the question of Fermi's ALU organisation: does it have distinct int32 ALUs, implying that they are idle while fp32 operations are going?
24 simds though also sound low if you consider that those vliw-4 units should be smaller than the vliw-5 ones - I have no good idea how much smaller (does distributing the tables from the t unit to xyz also make them smaller cause they are backed by 3 alus instead of one?) but to me it sounds reasonable to assume 24 vliw-4 simds wouldn't need more die area than 20 vliw-5 ones.
I haven't worked out how they've got it down to only 3 lanes (instead of using 4 - even 4 left significant open questions).
I wonder if there's more of the old T lane functionality hanging around than initial ideas about the deletion of T considered. Perhaps as far as limiting the savings to being solely in terms of the 5th MAD, the int32 MUL, the deletion of the circulating buffer of scalar registers (previous instruction registers for T) and the porting/wiring from the operand collector to T.
I've never seen an assessment of the proportion of the T lane that is specifically for transcendentals. We also don't know the proportion of per-core area consumed by the sequencer. Nor the proportion of overall die area consumed by shader engine thread control and wiring etc. and how much of that scales with SIMD count?