AMD: R9xx Speculation

According to Muropaketti, the early Cayman samples had 1680 stream processors. But the same article states that AMD has a habit of sending the first samples with a deliberately reduced amount of stream processors.
1680 is a pretty odd number, try to find some sensible simd config with that...

Cayman is 4-VLIW arch. Pro has 22 SIMD and XT has 24. 32KB is the LDS size not L1 cache size.
Where did you get the pro simd number from?
 
However , we still need to make up for the lost efficiency of processing transcendental operations , so 1536 ALU might be a little insufficient for that , and the need for higher ALUs count would be direr .
HD5870 has 35% more ALU capability than HD6870, but the difference in games is much lower.

A while back Gipsel and I hammered VLIW-4 with transcendentals:

http://forum.beyond3d.com/showpost.php?p=1483859&postcount=3435

that's 13 transcendentals + 25 other operations, and got a 19% slow-down. That's really about as bad as possible. With 20% more SIMDs the net effect on performance would be pretty much zero.
 
OK, I don't know if there's a simple answer to that question, but how common are transcendentals in games, anyway?
 
A while back Gipsel and I hammered VLIW-4 with transcendentals:

http://forum.beyond3d.com/showpost.php?p=1483859&postcount=3435

that's 13 transcendentals + 25 other operations, and got a 19% slow-down. That's really about as bad as possible. With 20% more SIMDs the net effect on performance would be pretty much zero.
Yeah , I remember that .

HD5870 has 35% more ALU capability than HD6870, but the difference in games is much lower.
True , I wonder where does the increase in efficiency come from ? may be from the increased clocks ? HD 6870 is 5% more powerful in fill rate because of it's 900MHz frequency , Also , faster ALU clocks could mean better utilization .

OK, I don't know if there's a simple answer to that question, but how common are transcendentals in games, anyway?
I guess it depends on the complexity of the lighting engine .
 
Yes, but I was thinking lately maybe efficiency drops if you have "too many" simds per shader engine.
Technically possible, since a shader engine has a limited number of hardware threads in flight and those limits probably get split according to possible shader types. Never seen any tests exploring this.

And if you don't like the 3 shader engines, what about 4 instead? Though I agree only 6 per dispatch processor would be quite low - I want the chip to have 28 simds in a 4x7 arrangement, with the pro being 4x6 instead :). (Barts also has only 7 simds in a group, though they are of course VLIW-5.)
4 would be fine. Though if Pro has 22 SIMDs, that's a bust.

This makes sense. Even if you assume you could get the same performance out of a VLIW-4 simd compared to a VLIW-5 (which is a bit of a stretch) 24 is only 20% more simds however, so performance improvements beyond that have to come from elsewhere. Also note there's an obvious difference between utilization of alu slots and alu instructions issued per clock - since transcendentals now require 3 slots even serial dependent transcendentals have 75% utilization - but obviously they aren't any faster than the 20% utilization of the same sequence in Evergreen.
Agreed to all that.

The utilisation question is a bit thorny with this change. e.g. referring back to the clause that forms the body of the loop in the code Gipsel and I were playing with:
  • Cypress 38 scalar ops in 16 cycles = 48% utilisation
  • Cayman 38 scalar ops in 19 cycles = 50% utilisation
  • Cayman 64 scalar ops (including all the portions of transcendentals) in 19 cycles = 84% utilisation
With more complex shaders it's going to be fairly fiddly to pick-apart the transcendental. But then it was fiddly to pick out the DOT4s that were being used for DOT3. And in truth, there was practically no-one who was counting :p

So architectural balance is the name of the game, and VLIW-4 seems better. Still hard to say whether the FLOPS/mm² has suffered when considering the entire area dedicated to cores (i.e. including all scheduling overheads).

There's still no resolution to the question of Fermi's ALU organisation: does it have distinct int32 ALUs, implying that they are idle while fp32 operations are going?

24 simds though also sound low if you consider that those vliw-4 units should be smaller than the vliw-5 ones - I have no good idea how much smaller (does distributing the tables from the t unit to xyz also make them smaller cause they are backed by 3 alus instead of one?) but to me it sounds reasonable to assume 24 vliw-4 simds wouldn't need more die area than 20 vliw-5 ones.
I haven't worked out how they've got it down to only 3 lanes (instead of using 4 - even 4 left significant open questions).

I wonder if there's more of the old T lane functionality hanging around than initial ideas about the deletion of T considered. Perhaps as far as limiting the savings to being solely in terms of the 5th MAD, the int32 MUL, the deletion of the circulating buffer of scalar registers (previous instruction registers for T) and the porting/wiring from the operand collector to T.

I've never seen an assessment of the proportion of the T lane that is specifically for transcendentals. We also don't know the proportion of per-core area consumed by the sequencer. Nor the proportion of overall die area consumed by shader engine thread control and wiring etc. and how much of that scales with SIMD count?
 
Cormorant:

"I can hold my breath for over two minutes!"


cormorant_icon.jpg

Wtf? :???:

EDITED BITS: Then again, it does look strangely familiar:

digs_avatar_150x150.jpg
 
So the family is called Northern Islands, but the chips are named after islands in the south of the Northern Hemisphere, and individual products are named after… random birds?

What the hell?
 
So much for Fuads 'biggest GPU ATI ever made'. This thing will be below 380mm2 and performance close to GTX580.
 
1680 is a pretty odd number, try to find some sensible simd config with that.
It would fit this configuration:

3 shader engines
shader engine consists of 8 SIMDs
SIMD consists of 20 WLIV4 units

3*8*20*4 = 1920 SPs

if you deactivate one SIMD per shader engine (symetrical redundancy), it will result in 1680 SPs (560 SPs * 3)... only the wavefront size wouldn't match current rumours... Anyway, this configuration isn't very likely :)
 
Back
Top