AMD: R9xx Speculation

chavvdarrr · Sep 30, 2010

Squilliam said:
Funny thing is given the shape of the first number im tempted to think that if its true then it is showing 640 rather than 480 stream processors. Perhaps this is a nod to their professional / HPC markets in that they are giving the one SKU a large number of stream processors because it is relevant to these markets as well. Barts doesn't have to cross into the same markets and therefore can stick with a more balanced architecture.

and I see "4" for Cayman RPEs :S
Still TMU/Rops look like double digit.

jimbo75 · Sep 30, 2010

Squilliam said:
Funny thing is given the shape of the first number im tempted to think that if its true then it is showing 640 rather than 480 stream processors. Perhaps this is a nod to their professional / HPC markets in that they are giving the one SKU a large number of stream processors because it is relevant to these markets as well. Barts doesn't have to cross into the same markets and therefore can stick with a more balanced architecture.

I agree that looks like a 640 to me. That could be a 500mm2 chip if so, and I don't know if AMD would go there again (esp not with a ring bus

).

Kaotik · Sep 30, 2010

chavvdarrr said:
and I see "4" for Cayman RPEs :S
Still TMU/Rops look like double digit.

That's a clear 3 IMO.
Gah, should just fire up PhotoShop and figure out which kinda blur they've used.

Squilliam · Sep 30, 2010

chavvdarrr said:
and I see "4" for Cayman RPEs :S
Still TMU/Rops look like double digit.

jimbo75 said:
I agree that looks like a 640 to me. That could be a 500mm2 chip if so, and I don't know if AMD would go there again (esp not with a ring bus ).

The thing is 480 stream processors is only ~33% larger than Barts and it is also quite likely that with it being a larger chip it won't clock as high so all things being considered there probably wouldn't be much difference between the two.

I know I could and am probably wrong!

Anyway, whats the chances they have developed an architecture which decouples the TMU quantity from the ROP and Stream processor SIMD count?

Kaotik · Oct 1, 2010

Squilliam said:
Anyway, whats the chances they have developed an architecture which decouples the TMU quantity from the ROP and Stream processor SIMD count?

Wasn't R600 exactly like that? TMUs were separated by SP SIMDs and could be used by any SP SIMD?

Shtal · Oct 1, 2010

Shtal · Oct 1, 2010

I' am confident that ATI/AMD will not go over 400mm2 die size.
Specs like, 480(X4) processors, 48 ROP's, 96TMU's and 30 SIMD's...... Looks more legit.

Cayman 640(X3) or 640(X4) processors will be as big as Nvidia Fermi GPU over 500mm2
AMD should / will avoid building such large chip. "AMD will NOT be like Nvidia".

Shtal · Oct 1, 2010

Looking at old bench test here for ATI 6870 http://www.google.com/imgres?imgurl...mages?q=ati+6870+gpu-z&hl=en&gbv=2&tbs=isch:1
Default clocking GPU @ 850MHz clearly tells me such high clock speed GPU will NOT be over 400mm2.

EDIT: To keep Cayman GPU under 200W TDP with high clock speed ONLY be possible with smaller die size.

UniversalTruth · Oct 1, 2010

Shtal said:
I' am confident that ATI/AMD will not go over 400mm2 die size.
Specs like, 480(X4) processors, 48 ROP's, 96TMU's and 30 SIMD's...... Looks more legit.

Cayman 640(X3) or 640(X4) processors will be as big as Nvidia Fermi GPU over 500mm2
AMD should / will avoid building such large chip. "AMD will NOT be like Nvidia".

No, 2560 SPs will not result in such a big chip, over 500 mm2. I don't know where your estimations come from.
I am sure that if Cayman comes with 1920 activated SPs, it doesn't necessarily mean that in fact it hasn't more than 2000.

neliz · Oct 1, 2010

Shtal said:
Looking at old bench test here for ATI 6870 http://www.google.com/imgres?imgurl...mages?q=ati+6870+gpu-z&hl=en&gbv=2&tbs=isch:1
Default clocking GPU @ 850MHz clearly tells me such high clock speed GPU will NOT be over 400mm2.

EDIT: To keep Cayman GPU under 200W TDP with high clock speed ONLY be possible with smaller die size.

You know what would be funny? If the original rumors turned out to be true, where "6700" (now Barts, 6800) would indeed be faster than a GTX480 and do 3TFlop.

and re: Die size.. don't worry!

UniversalTruth · Oct 1, 2010

OMG! That wouldn't be funny, that would be great and would be a great explanation why Barts is 6800 series.

I will be more than happy.

MarkoIt · Oct 1, 2010

Shtal said:
Cayman 640(X3) or 640(X4) processors will be as big as Nvidia Fermi GPU over 500mm2
AMD should / will avoid building such large chip. "AMD will NOT be like Nvidia".

It's possible to guestimate the die-size of both configuration, considering Cypress die-size and rumors.
Cypress has a die of 324 mm^2. 1/3 is the space taken by the SIMDs (info that comes from RV770). So in Cypress 108 mm^2 are taken by 1600sp. If the 25% increase space efficiency is true, than in N.I 1280 sp (320x4) can fit in 81 mm^2 and perform as the 1600sp of Cypress. Double that gives, 162 mm^2 for SIMD. A 20% of increase in complexity of the uncore/fixed function unit (TMU/rops), gives about a die-size of 420-430 mm^2 for 2560sp/96TMU/48rops and 380-390 mm^2 for 1920sp/96TMU/48 rops.
By the way, Cayman ROPs can't be 32.. because 32/3 give 10,6 rops per RPE

And the same for the SIMD number..640 can't fit 3 RPE.

Arnold Beckenbauer · Oct 1, 2010

Jawed said:
That's no different from 64 TMUs and 320x4 ALU lanes.

In both cases it seems an RPE consists of 10 SIMDs. With 32 TMUs.

Which would imply TMUs are shared within an RPE by all the SIMDs ...

... or something along the lines of the patents I've been talking about, where TMUs are shared by SIMDs. The patents talk about a "processor" producing two filtered results independently and also sharing texel data (unfiltered texels, not texel results) amongst L1s, with 2 TMUs seemingly sharing an L1. Those two concepts would appear to tally with this peculiar setup.

R600 shares only results, I think, not original texel data (or, if you prefer, texel data isn't shared amongst L1s, only amongst L2s). Though it would be funny if a ring-bus appeared.

Barts Pro, presumably, has SIMDs turned off. It presumably also has TMUs turned off. So with SIMDs being much larger than TMUs, it's likely that while only 1 quad-TMU per RPE is turned off, 2 or more SIMDs would be turned off.

e.g. 1024 ALU lanes and 56 TMUs.

I have to admit I've got a queasy feeling about the "non-integer-multiple" SIMD:quad-TMU thing going on here.

Crap. You rule.
But how did HD2900GT work? It had 12 TMUs.

fellix · Oct 1, 2010

R600 doesn't need to disable TMU along with a SIMD.

Jawed · Oct 1, 2010

Arnold Beckenbauer said:
Crap. You rule.

You asked the right question though, as it's a useful comparison

But how did HD2900GT work? It had 12 TMUs.

R600 architecture matches the count of quad-ALUs per SIMD to the count of quad-TMUs, as in R6xx TMU count says nothing about ALU SIMD count. So 240 ALU lanes is 4 SIMDs each containing 3 quad VLIW-5 (60 ALU lanes per SIMD). So 3 quad-TMUs, where each quad-TMU feeds its results solely to the same numbered quad-ALU inside all 4 of the SIMDs.

Code:

  TMU1  TMU2  TMU3
    |     |     |
  ALU1  ALU2  ALU3  - SIMD 1
    |     |     |
  ALU1  ALU2  ALU3  - SIMD 2
    |     |     |
  ALU1  ALU2  ALU3  - SIMD 3
    |     |     |
  ALU1  ALU2  ALU3  - SIMD 4

So in that sense this new design is like R600. So the question is: what kind of bus architecture, what kind of inter-cache sharing is used, and how far do texturing results travel?

Because R6xx never exceeded 4 SIMDs, we don't know what TMU sharing arrangement would have been implemented with more SIMDs and more TMUs. e.g. it appears likely these new GPUs have TMUs localised to an RPE, whereas R600 shares them across all SIMDs.

R5xx and R6xx don't match ROPs to MCs, whereas R7xx does. R7xx therefore keeps a vast amount of bandwidth local to an MC and so it's easier to get away without a ring bus. Though in fact the internal bandwidth from L2s->L1s is ~4x higher than off-die bandwidth. Being unidirectional helps, I guess (obviously commands go the other way, but not a huge bandwidth).

There's also shader export data that needs to go to the MCs (GS export or PS export) but that's relatively low bandwidth, e.g. 32 4-byte results per clock.

Barts with 64 TMUs might only have 8 L1s, which is an improvement on the 10 of Juniper. Each L1 in this setup would be dual-ported, feeding two quad-TMUs, which obviously adds local complexity.

The other side of the bandwidth question is making read/write UAV access efficient. Global atomics are very efficient in Evergreen, but more generic operations are un-cached. Really they should be cached (this would also help register spill). The patents I've been referring to seem to hint that they will hold data written to them by SIMDs, not merely that they will hold texel data.

In the extreme it's possible they've implemented a distributed coherent read/write L1, which would make read-write UAV work really fast. L2, in this scenario, wouldn't do coherency, it merely increases efficiency of MC and ROP operations. I suppose...

fellix · Oct 1, 2010

Here's a rather oversimplified diagram of what I think Cayman is:

Each SIMD array is flanked by two 16xTMU clusters, in R600 "style".

jimbo75 · Oct 1, 2010

Shtal said:
I' am confident that ATI/AMD will not go over 400mm2 die size.
Specs like, 480(X4) processors, 48 ROP's, 96TMU's and 30 SIMD's...... Looks more legit.

Cayman 640(X3) or 640(X4) processors will be as big as Nvidia Fermi GPU over 500mm2
AMD should / will avoid building such large chip. "AMD will NOT be like Nvidia".

While I tend to agree, there is no reason why AMD couldn't go with a 640 shader behemoth, coming in at around 50% faster than the gtx 480. They won't do it, but they could.

Entropy · Oct 1, 2010

Jawed said:
In the extreme it's possible they've implemented a distributed coherent read/write L1

Impact on L1 latency?

Jawed · Oct 1, 2010

Entropy said:
Impact on L1 latency?

Probably.

mczak · Oct 1, 2010

Jawed said:
R600 architecture matches the count of quad-ALUs per SIMD to the count of quad-TMUs, as in R6xx TMU count says nothing about ALU SIMD count. So 240 ALU lanes is 4 SIMDs each containing 3 quad VLIW-5 (60 ALU lanes per SIMD). So 3 quad-TMUs, where each quad-TMU feeds its results solely to the same numbered quad-ALU inside all 4 of the SIMDs.

I completely forgot that simd width (and hence wavefront size) could not only be varied with different chips, but actually also within a specific chip by disabling parts.
So, since that's apparently easy, wouldn't that make it more likely AMD has indeed increased wavefront size to 80, and a RPE is just 8 SIMDs, with a more traditional TMU arrangement (quad tmu per simd)? So the tmus are still local to SIMDs, and RPE just includes rasterizer plus some more stuff (tesselator plus some more).

AMD: R9xx Speculation

chavvdarrr

jimbo75

Kaotik

Drunk Member

Squilliam

Beyond3d isn't defined yet

Kaotik

Drunk Member

Shtal

Shtal

Shtal

UniversalTruth

neliz

GIGABYTE Man

UniversalTruth

MarkoIt

Arnold Beckenbauer

fellix

Jawed

fellix

jimbo75

Entropy

Jawed

mczak

Similar threads