AMD: R9xx Speculation

mczak · Nov 6, 2010

Miksu said:
According to Muropaketti, the early Cayman samples had 1680 stream processors. But the same article states that AMD has a habit of sending the first samples with a deliberately reduced amount of stream processors.

1680 is a pretty odd number, try to find some sensible simd config with that...

Forrest said:
Cayman is 4-VLIW arch. Pro has 22 SIMD and XT has 24. 32KB is the LDS size not L1 cache size.

Where did you get the pro simd number from?

Forrest · Nov 6, 2010

mczak said:
Where did you get the pro simd number from?

I have the gpu!

MarkoIt · Nov 6, 2010

24 SIMD for XT.. so
16-way -> 1536sp
or
20-way-> 1920sp

Jawed · Nov 6, 2010

DavidGraham said:
However , we still need to make up for the lost efficiency of processing transcendental operations , so 1536 ALU might be a little insufficient for that , and the need for higher ALUs count would be direr .

HD5870 has 35% more ALU capability than HD6870, but the difference in games is much lower.

A while back Gipsel and I hammered VLIW-4 with transcendentals:

http://forum.beyond3d.com/showpost.php?p=1483859&postcount=3435

that's 13 transcendentals + 25 other operations, and got a 19% slow-down. That's really about as bad as possible. With 20% more SIMDs the net effect on performance would be pretty much zero.

Alexko · Nov 6, 2010

OK, I don't know if there's a simple answer to that question, but how common are transcendentals in games, anyway?

Forrest · Nov 6, 2010

MarkoIt said:
24 SIMD for XT.. so
16-way -> 1536sp
or
20-way-> 1920sp

Wavefront size is 64 so its still 16-way.

DavidGraham · Nov 6, 2010

Jawed said:
A while back Gipsel and I hammered VLIW-4 with transcendentals:

http://forum.beyond3d.com/showpost.php?p=1483859&postcount=3435

that's 13 transcendentals + 25 other operations, and got a 19% slow-down. That's really about as bad as possible. With 20% more SIMDs the net effect on performance would be pretty much zero.

Yeah , I remember that .

HD5870 has 35% more ALU capability than HD6870, but the difference in games is much lower.

True , I wonder where does the increase in efficiency come from ? may be from the increased clocks ? HD 6870 is 5% more powerful in fill rate because of it's 900MHz frequency , Also , faster ALU clocks could mean better utilization .

Alexko said:
OK, I don't know if there's a simple answer to that question, but how common are transcendentals in games, anyway?

I guess it depends on the complexity of the lighting engine .

MarkoIt · Nov 6, 2010

Forrest said:
Wavefront size is 64 so its still 16-way.

So it's 1536sp/96TMUs for the XT and 1408sp/88TMUs for the PRO.

Jawed · Nov 6, 2010

mczak said:
Yes, but I was thinking lately maybe efficiency drops if you have "too many" simds per shader engine.

Technically possible, since a shader engine has a limited number of hardware threads in flight and those limits probably get split according to possible shader types. Never seen any tests exploring this.

And if you don't like the 3 shader engines, what about 4 instead? Though I agree only 6 per dispatch processor would be quite low - I want the chip to have 28 simds in a 4x7 arrangement, with the pro being 4x6 instead . (Barts also has only 7 simds in a group, though they are of course VLIW-5.)

4 would be fine. Though if Pro has 22 SIMDs, that's a bust.

This makes sense. Even if you assume you could get the same performance out of a VLIW-4 simd compared to a VLIW-5 (which is a bit of a stretch) 24 is only 20% more simds however, so performance improvements beyond that have to come from elsewhere. Also note there's an obvious difference between utilization of alu slots and alu instructions issued per clock - since transcendentals now require 3 slots even serial dependent transcendentals have 75% utilization - but obviously they aren't any faster than the 20% utilization of the same sequence in Evergreen.

Agreed to all that.

The utilisation question is a bit thorny with this change. e.g. referring back to the clause that forms the body of the loop in the code Gipsel and I were playing with:

Cypress 38 scalar ops in 16 cycles = 48% utilisation
Cayman 38 scalar ops in 19 cycles = 50% utilisation
Cayman 64 scalar ops (including all the portions of transcendentals) in 19 cycles = 84% utilisation

With more complex shaders it's going to be fairly fiddly to pick-apart the transcendental. But then it was fiddly to pick out the DOT4s that were being used for DOT3. And in truth, there was practically no-one who was counting

So architectural balance is the name of the game, and VLIW-4 seems better. Still hard to say whether the FLOPS/mm² has suffered when considering the entire area dedicated to cores (i.e. including all scheduling overheads).

There's still no resolution to the question of Fermi's ALU organisation: does it have distinct int32 ALUs, implying that they are idle while fp32 operations are going?

24 simds though also sound low if you consider that those vliw-4 units should be smaller than the vliw-5 ones - I have no good idea how much smaller (does distributing the tables from the t unit to xyz also make them smaller cause they are backed by 3 alus instead of one?) but to me it sounds reasonable to assume 24 vliw-4 simds wouldn't need more die area than 20 vliw-5 ones.

I haven't worked out how they've got it down to only 3 lanes (instead of using 4 - even 4 left significant open questions).

I wonder if there's more of the old T lane functionality hanging around than initial ideas about the deletion of T considered. Perhaps as far as limiting the savings to being solely in terms of the 5th MAD, the int32 MUL, the deletion of the circulating buffer of scalar registers (previous instruction registers for T) and the porting/wiring from the operand collector to T.

I've never seen an assessment of the proportion of the T lane that is specifically for transcendentals. We also don't know the proportion of per-core area consumed by the sequencer. Nor the proportion of overall die area consumed by shader engine thread control and wiring etc. and how much of that scales with SIMD count?

digitalwanderer · Nov 6, 2010

Cormorant:

"I can hold my breath for over two minutes!"

Wtf?

EDITED BITS: Then again, it does look strangely familiar:

neliz · Nov 6, 2010

Jawed said:
"Cormorant" code name?

Barts Pro was Buzzard.. Birds!

LordEC911 · Nov 6, 2010

neliz said:
Barts Pro was Buzzard.. Birds!

Well, at least it's better than all the plant stuff from last silly season.

Alexko · Nov 6, 2010

So the family is called Northern Islands, but the chips are named after islands in the south of the Northern Hemisphere, and individual products are named after… random birds?

What the hell?

Tchock · Nov 6, 2010

Jawed said:
4 would be fine. Though if Pro has 22 SIMDs, that's a bust.

Sure that the reduction can't be asymmetrical?

digitalwanderer · Nov 6, 2010

Oooh, there is the possibility of a card codenamed the shag then!

Or is shag just another name for cormorant?

Jawed · Nov 6, 2010

Tchock said:
Sure that the reduction can't be asymmetrical?

Existing dual shader-engine cards have symmetrical (in terms of count) coarse-redundancy. Can't add any more to that.

This design might have a single shader engine.

chavvdarrr · Nov 6, 2010

How high Cayman's clock should be in order to fight with 580?
It looks like AMD will lose this round

SimBy · Nov 6, 2010

So much for Fuads 'biggest GPU ATI ever made'. This thing will be below 380mm2 and performance close to GTX580.

MarkoIt · Nov 6, 2010

Jawed said:
Existing dual shader-engine cards have symmetrical (in terms of count) coarse-redundancy. Can't add any more to that.

This design might have a single shader engine.

Or two 12 SIMD shader engine.

no-X · Nov 6, 2010

mczak said:
1680 is a pretty odd number, try to find some sensible simd config with that.

It would fit this configuration:

3 shader engines
shader engine consists of 8 SIMDs
SIMD consists of 20 WLIV4 units

3*8*20*4 = 1920 SPs

if you deactivate one SIMD per shader engine (symetrical redundancy), it will result in 1680 SPs (560 SPs * 3)... only the wavefront size wouldn't match current rumours... Anyway, this configuration isn't very likely

AMD: R9xx Speculation

mczak

Forrest

MarkoIt

Jawed

Alexko

Forrest

DavidGraham

MarkoIt

Jawed

digitalwanderer

neliz

GIGABYTE Man

LordEC911

Alexko

Tchock

digitalwanderer

Jawed

chavvdarrr

SimBy

MarkoIt

no-X

Similar threads