AMD: R9xx Speculation

mczak · Apr 22, 2010

psolord said:
My objection to what Neliz said is that he talked about the 6700 being 20% faster than the GTX 480. Not the 6800 series. Obviously I objected due to the naming scheme of which we know nothing about. I just based my objection on the current naming scheme, ie the 6770 should be the higher end of the mainstream cards which in turn should be something much less than the supposed 6870.

Now for the 6870 to be 20% faster than the GTX 480, that is perfectly understandable.

Ah yes didn't see there's also a HD6800 I completely agree.

For all we know, ATI could release just a 6770/6750 on this bastard architecture, which could be based on a 2.4BIO transistor chip, and release the rest of the cards, ie the 68XX at 28nm with the new architecture and about 4 BIO transistor chip. That way, the naming and the 20% increase over the GTX 480, would fit nicely. It would also put a ZOMG scare factor to Nvidia, knowing their high end baby got beaten by the mainstream card of ATI's next series.

Goes a bit against the idea of releasing whole series in a short timeframe, but I guess it would make sense. Arguably, Cypress could benefit the most from a refresh (scales badly compared to Juniper, and the slow tesselation should be much less of a problem for Juniper and below). Whatever AMD took out of the initial Cypress for time to market reasons they could also put back in (it's unclear to me if those scratched parts would have also affected the lower end parts).
Indeed the 68xx on 40nm with 4 times the flops of gtx480 seems rather unlikely - would be twice the flops of Cypress, that would be one massive chip on 40nm.
If however 67xx is on 40nm and 68xx on 28nm, I'm missing the "pipe cleaner" part on 28nm. I don't see AMD starting on 28nm with such a massive chip, but maybe there could be some 66xx series or so before the 68xx on 28nm?

no-X · Apr 22, 2010

If the HD6770 is ~420mm² at 40nm, it would measure about 220mm² at 28nm. That could be enough for 256bit GDDR5 interface (without sideport). 28nm part could later replace the inital 40nm model as slightly faster HD6790.

leoneazzurro · Apr 22, 2010

mczak said:
Indeed the 68xx on 40nm with 4 times the flops of gtx480 seems rather unlikely - would be twice the flops of Cypress, that would be one massive chip on 40nm.?

And what if the new chip has less shaders but different clock domains for shaders and the rest of the chip? I mean, if really it has double the FLOPs with less SP it should have crazy high clocks. And to improve tessellation dramatically, it should have an approach similar to Fermi's.

So, it could be maybe a chip with SIMD clusters with a setup and tessellation engine for each cluster (maybe four "big" clusters with more than one SIMD inside? :???:

) and the ALUs are not scalar but 4-way VLIW with clock domain that is 2x the texturing-ROPs-etc.
Also the scheduling should have changed substantially, I think.

Hmm.. a lot of rumors and a lot of possibilities...

ferro · Apr 22, 2010

I think neliz got his information from tweakers.net rumors. A rough translation/interpretation from tweakers.net:

Hybrid Evergreen/NI
Much improved stream processor architecture
Much improved tesselation unit with 3 or 4 times Cypress performance
Enhanced rasterizer for improved efficiency
Improved UVD unit
Improved cache architecture for better GPGPU performance
6600 has 1 "SP module", 40nm, planned for Q4 2010
6700 has 2 "SP modules", 40nm, 10-20% faster than GF100 with 512SP, 400-440mm2, planned for Q3 2010
6800 has 4 "SP modules", 28nm, 512 bit memory bus, planned for Q1 2011

CRoland · Apr 22, 2010

ferro said:
I think neliz got his information from tweakers.net rumors. A rough translation/interpretation from tweakers.net:

Quite a lot of changes for a supposedly unplanned generation.

TKK · Apr 22, 2010

My take on the matter of SI using 4D instead of 5D is (something like this was actually suggested by someone else already, not sure wether it was here or on S|A though):
SP-functionality is getting removed from the T-unit, which now serves as a pure Special Function Unit instead of being a jack-of-all-trades. This reduces the 'official' number of SPs by 1/5th per SP cluster.

Number of SIMDs per block increased from 10 to 12, resulting in a total of 24 for the fastest 40nm part. This would result in 'only' 1536 SPs (slightly below Cypress), but 20% more TMUs and ex-T-units-now-SFUs.

Add those uncore improvements to the equation, and ~20-40% more performance (depending on game and workload) compared to Cypress seem absolutely possible.

Jawed · Apr 22, 2010

Mindfury said:
nApoleon said R9xx chip would have less stream processors than Cypress.

So it could be 24 SIMD or 20 SIMD.

24 could be reasonable I suppose, 1536 lanes. That would leave real world FLOPS pretty much unchanged, the ALUs would take about the same space as Cypress's (i.e. assuming a 20% saving per SIMD by deleting T, as I described earlier) and all the extra die space would be dedicated to the swathe of efficiency improvements - the stuff that was left out of Cypress due to the 40nm problems :smile:

Though I wouldn't bet against only 20 SIMDs, 1280 lanes.

All this in a "HD6770"

now that would be neat. EDIT: Hmm, if the modules are 8 SIMDs, then we could be looking at 16 SIMDs, 1024 lanes.

I wonder if Hecatoncheires was the original name for Evergreen + 1, but when 40nm troubles arose AMD split Evergreen + 1 into SI and NI: SI being on a process older than 28nm (was going to be 32nm, but is now 40nm) and NI being 28nm. So many different ways to argue this

Jawed

mczak · Apr 22, 2010

leoneazzurro said:
And what if the new chip has less shaders but different clock domains for shaders and the rest of the chip? I mean, if really it has double the FLOPs with less SP it should have crazy high clocks.

You can't just increase clock (a lot) for the alus without changing them (very substantially), hence that seems unlikely. Might not be worth it (based on transistor count, power) anyway.

ferro said:
[*]Much improved tesselation unit with 3 or 4 times Cypress performance

That wouldn't be that impressive, though I guess might be "good enough".

[*]Improved UVD unit

Hmm, I wonder what it's missing today.

[*]Improved cache architecture for better GPGPU performance

Might also fix the supposedly low internal bandwidth problem I guess, hence might be beneficial to graphics too?

[*]6600 has 1 "SP module", 40nm, planned for Q4 2010
[*]6700 has 2 "SP modules", 40nm, 10-20% faster than GF100 with 512SP, 400-440mm2, planned for Q3 2010
[*]6800 has 4 "SP modules", 28nm, 512 bit memory bus, planned for Q1 2011

That would look quite good if true.

TKK said:
My take on the matter of SI using 4D instead of 5D is (something like this was actually suggested by someone else already, not sure wether it was here or on S|A though):
SP-functionality is getting removed from the T-unit, which now serves as a pure Special Function Unit instead of being a jack-of-all-trades. This reduces the 'official' number of SPs by 1/5th per SP cluster.

That would make sense since nvidia did something similar (no more mul/interpolation done in sfu), though nvidia also reduced the number of sfus (if t unit loses normal operations, that'll give a normal/sfu rate of 1:4, nvidia has 1:8 now).
The only "non-special" instructions which can only be done in the t unit currently are those related to the 32bit int multiplies I think. I'd guess if the t unit is going to be less jack-of-all-trades it'll lose the multiplier. 32bit int muls though should be doable by combining the normal units.

Number of SIMDs per block increased from 10 to 12, resulting in a total of 24 for the fastest 40nm part. This would result in 'only' 1536 SPs (slightly below Cypress), but 20% more TMUs and ex-T-units-now-SFUs.

I wonder how that compares in area-efficiency. I guess the expectation is this will perform better per transistor count? In any case losing a tiny bit of flops shouldn't be that big of a deal (we've seen Fermi being faster overall with half the flops). Should also make utilization of ALUs a bit higher I guess.

Add those uncore improvements to the equation, and ~20-40% more performance (depending on game and workload) compared to Cypress seem absolutely possible.

Uncore could probably be more important than the shader changes as outlined above imho, though I guess no matter how you look at it if it's 25% bigger it should indeed be also quite a bit faster.

racca · Apr 22, 2010

Jawed said:
24 could be reasonable I suppose, 1536 lanes. That would leave real world FLOPS pretty much unchanged, the ALUs would take about the same space as Cypress's (i.e. assuming a 20% saving per SIMD by deleting T, as I described earlier) and all the extra die space would be dedicated to the swathe of efficiency improvements - the stuff that was left out of Cypress due to the 40nm problems :smile:

Though I wouldn't bet against only 20 SIMDs, 1280 lanes.

All this in a "HD6770" now that would be neat. EDIT: Hmm, if the modules are 8 SIMDs, then we could be looking at 16 SIMDs, 1024 lanes.

I wonder if Hecatoncheires was the original name for Evergreen + 1, but when 40nm troubles arose AMD split Evergreen + 1 into SI and NI: SI being on a process older than 28nm (was going to be 32nm, but is now 40nm) and NI being 28nm. So many different ways to argue this

Jawed

There's no way you could've been right.

1. SIMD is NOT just ALUs, it's bundled with (quite large) caches, registers, TMUs, etc. If you take the ALU out of T-units, it could only save you much less than 20% even just for the ALU parts -- it's 4xALU+(ALU+SFU)->4xALU+SFU, you'd be lucky to get a 5% shrink per SIMD core. Counting 20% increase in SIMD units, you should spend around 15% more on the shader core

2. 20 SIMD would be unacceptable, unless AMD can find a way to allow SI to run at a much higher clock, ie 25%+, there's no way 1280sp parts could outperform HD5870/GTX480, let alone 1024sp. So what's the point in having those parts if they can't even beat 5870.

Jawed · Apr 22, 2010

racca said:
1. SIMD is NOT just ALUs, it's bundled with (quite large) caches, registers, TMUs, etc.

The SIMDs are ALUs, registers, LDS and buses to get data in and out. I'm specifically excluding TMUs and L1s - in fact LDS should be excluded since it's fixed size per core regardless of the ALU count. You can call the whole lot clusters or cores.

My guesstimate of 20% saving going from XYZWT to XYZW is excluding registers and LDS (which wouldn't change) but includes savings in routing and buffering that T requires.

On the other hand you're right, I'm mistakenly ignoring the fact that, say, 24 SIMDs adds TMUs, registers and all the other stuff. So on that side of the equation 24 SIMDs would see an overall increase in the area consumed by the cores, i.e. cutting into the supposed ~20% die size budget increase.

It's hard to tell but the cores in Cypress are probably in the region of 30-40% of the die. But that's pretty woolly - RV770's cores are ~41% of the die, while the SIMDs are ~29%. Anyway, increasing the count of TMUs, L1s and other stuff that scales with core count would cost some extra area.

In RV770 registers are 29% versus 71% for the ALUs when looking at just the computation part of the ALUs (i.e. excluding redundancy and LDS).

2. 20 SIMD would be unacceptable, unless AMD can find a way to allow SI to run at a much higher clock, ie 25%+, there's no way 1280sp parts could outperform HD5870/GTX480, let alone 1024sp. So what's the point in having those parts if they can't even beat 5870.

If this is really HD6770, then it's a cut-down part which, due to the demise of 32nm, is a bit larger than it should have been... In my view that makes it likely to have no more ALUs. And if I'm right about the way things pan out with an XYZW configuration, performance won't be notably affected.

We'll be arguing about its ALU count right up until the moment we know precisely what it is

The real replacement for HD5870, HD6870, theoretically only appears once 28nm is ready. Something like 18 months after HD5870?

Though this does raise the question: "what happened to the refresh of HD5870?". Dunno, still mulling that over.

Jawed

neliz · Apr 22, 2010

CRoland said:
Quite a lot of changes for a supposedly unplanned generation.

There was supposed to be an Evergreen refresh in March.

Alexko · Apr 22, 2010

neliz said:
There was supposed to be an Evergreen refresh in March.

Any idea what happened to that?

neliz · Apr 22, 2010

Alexko said:
Any idea what happened to that?

A launch turned fizzle?

Alexko · Apr 22, 2010

neliz said:
A launch turned fizzle?

Sure, but if the refresh is ready, why not release it?

GZ007 · Apr 22, 2010

Does it make a sense to waste money on a refresh with the same troubling tsmc 40nm :?:

Wasnt the refresh suposed to be on the TSMC 32nm and with it the refresh was killed too and changed to SI :?:

leoneazzurro · Apr 22, 2010

mczak said:
You can't just increase clock (a lot) for the alus without changing them (very substantially), hence that seems unlikely. Might not be worth it (based on transistor count, power) anyway.

It depends, the rumors say that indeed the shaders are indeed changed, and if the SP count is lower than Cypress but FLOPS are much higher then it would be plausible to suppose that ALU clock is much higher. But to run all the chip at high speed seem unlikely, henche the idea that there could be a different clock domain for shaders. Or maybe they are EXTREME+ shaders, who knows

MfA · Apr 22, 2010

Jawed said:
The key question is: can you make spilling registers the target of latency-hiding

My problem is that at the size of their register sets they are very similar to caches already ... the access method differs somewhat, the access of caches takes more pipeline stages and power for instance, but at the size of GPU's register sets they start behaving more like caches than "normal" registers in access time (exemplified by the PV/PS registers). It's not like the "normal" register sets like say Larrabee has. I'm not convinced however that what Larrabee is doing makes sense, I think the majority of L1 cache accesses will be pushing and popping while running shaders.

If the GPU approach makes sense, but you start adding caches isn't there another compromise ... might it be possible to just combine the registers with the cache in a unified pool? (Along with local storage.)

Lets say that instead of normal registers we give each SC a window on the cache? The method of access would simply use direct indexing, so it would bypass the normal tag comparisons etc. which make the accesses expensive time and power wise. NVIDIA already did this for local storage, why not for registers as well?

3dilettante · Apr 22, 2010

MfA said:
My problem is that at the size of their register sets they are very similar to caches already ... the access method differs somewhat, the access of caches takes more pipeline stages and power for instance, but at the size of GPU's register sets they start behaving more like caches than "normal" registers in access time (exemplified by the PV/PS registers).

Isn't this more an artifact of having a VLIW architecture? Standard pipelines also have pipeline registers and bypass, a VLIW would just expose what a more complex chip does automatically.

jaredpace · Apr 22, 2010

ferro said:
6700 has 2 "SP modules", 40nm, 10-20% faster than GF100 with 512SP, 400-440mm2, planned for Q3 2010

6800 has 4 "SP modules", 28nm, 512 bit memory bus, planned for Q1 2011

My guess is a "SP module" is a block of 10 SIMD. 4x16x10 = 640 SP in a Module. 6700 = 1280 SP and 6800 = 2560 SP. Edit: If a block of 12, then 1536 SP & 3072 SP.

LordEC911 · Apr 23, 2010

no-X said:
If the HD6770 is ~420mm² at 40nm, it would measure about 220mm² at 28nm. That could be enough for 256bit GDDR5 interface (without sideport). 28nm part could later replace the inital 40nm model as slightly faster HD6790.

I was thinking the samething. ~420mm2 on 40nm w/ 2.8-3bil trannies. Shrink to 28nm in Q1 and effectively cut the die size almost in half.

jaredpace said:
My guess is a "SP module" is a block of 10 SIMD. 4x16x10 = 640 SP in a Module. 6700 = 1280 SP and 6800 = 2560 SP. Edit: If a block of 12, then 1536 SP & 3072 SP.

I would say 12SIMDs makes more sense with neliz's comment of 4x the flops of, my interpretation, GTX480.
The real question is, will 6800 be a single 28nm GPU, 3072SPs part w/ a 512bit bus on 400-480mm2? Or is it just a dual GPU part of a 6700 shrinked to 28nm?

AMD: R9xx Speculation

mczak

no-X

leoneazzurro

ferro

CRoland

TKK

Jawed

mczak

racca

Jawed

neliz

GIGABYTE Man

Alexko

neliz

GIGABYTE Man

Alexko

GZ007

leoneazzurro

MfA

3dilettante

jaredpace

LordEC911

Similar threads