AMD: R9xx Speculation

MarkoIt · Sep 29, 2010

One RPE could be also 8 SIMD each one with 80sp with a quad TMU attached each.

Oh, what about MC?
Caicos for sure 128bit
Barts for sure 256bit
But Cayman? 256 or 384bit? too me it's seems to read a 48 in Cayman ROPs
Edit: i forgot about the card picture with 8 chips.. but maybe it was a fake!

With those specs, Cayman should on paper 50% faster than Bart.

Jawed · Sep 29, 2010

One solution to the SIMD<->TMU mapping problem is if the TMUs are actually with the ROPs alongside the memory controllers. Cayman appears to have 32 ROPs. Could it have 64 TMUs?

In RV770 the TMUs and LDS were located together and LDS and TMUs seemed to share data paths (or at the very least timings). With Evergreen, LDS became independent.

After that, you could argue the TMUs could move anywhere.

The problem with this, though, is the vast bandwidth that's needed from TMUs to SIMDs. This would be a step in the wrong direction in comparison with Evergreen. On the other hand, these data paths need to exist for colour buffer writes (and other export functions) and also for global atomics.

So putting the TMUs near the MCs with the ROPs would mean TMUs and ROPs are sharing a bus to the SIMDs. Additionally, since all SIMDs need to talk to all MCs/L2s/atomics, the TMUs would end up being shared globally too.

So, ahem, what kind of bus/crossbar is going to do that

Ring bus 2.0?

fellix · Sep 29, 2010

Ring-bus seems fine, but with that many clients on it, could the round-trip latency become an issue?

no-X · Sep 29, 2010

Isn't this slide too ugly to be real?

Alexko · Sep 29, 2010

fellix said:
Ring-bus seems fine, but with that many clients on it, could the round-trip latency become an issue?

Intel seems to think a ring bus is just fine for 32 CPUs on Larrabee, for what it's worth…

Kaotik · Sep 29, 2010

Jawed said:
One solution to the SIMD<->TMU mapping problem is if the TMUs are actually with the ROPs alongside the memory controllers. Cayman appears to have 32 ROPs. Could it have 64 TMUs?

The blurred number suggests 96 (the white blur is bit weaker on lower left and upper right corners, consistent on how 9 and 6 should look like blurred up)

3dilettante · Sep 29, 2010

Alexko said:
Intel seems to think a ring bus is just fine for 32 CPUs on Larrabee, for what it's worth…

It's fine for up to 16.
For 32, there are two linked rings.

racca · Sep 30, 2010

no-X said:
I speak about high-end / midrange.

You said nothing about mid/high range. XX70 fits ALL. Your statement is false, period.
If you said X870/X770, I would have agreed, but you didn't.

There's not a single reason to expect, that HD6850 and 6870 will be based on different GPUs.

I didn't say I was expecting that, did I?
It's true that I think 6850/6870 naming scheme is a bit silly, but it doesn't mean I'm expecting what others have said.
IMHO, BartsXT would be better off with a 6830 label on it, alongside BartsPRO as 6770, and possibly a Cypress (1280SP/32ROP/~700MHz) as 6750/6730.

AMD would have a chance to rebrand 5000 series (Cypress and Juniper with different specs) entirely as a stop-gap solution. And devote more man-hour to 28nm full-fledged NI family instead.

racca · Sep 30, 2010

Jawed said:
After that, you could argue the TMUs could move anywhere.

So in essence, a decoupled TMU cluster per ROP/RBE block or per SIMD block? (with 96 TMUs, the latter would be more likely to be true)

fellix said:
Ring-bus seems fine, but with that many clients on it, could the round-trip latency become an issue?

Well I'm sure if AMD decide to use it this time. They won't make the same mistake all over again.

DavidGraham said:
doesn't TMU sharing add latency and conflicts?

Not if you can do it right. with more filter functions moved to ALUs. Shared TMU cluster might be just the right solution.

GZ007 · Sep 30, 2010

Some TMU sharing could have meaning. Not all pixels need the same fixed ALU/TEX ratio and bandwith. Some pixels can have several high resolution textures while others none. Fixed ALU/TEX ratio can help in theretical texel rate benchmarks but in real games if u could watch each pixels rendering time in a single second than there could be a lot of diferences.(and also botleencks from other parts of the gpu). So the TMU disadvantage is gone after a second of rendering (gtx480 vs 5800).

DeF · Sep 30, 2010

fellix said:
Is it me, or there's a potential imbalance -- an opposite case to Fermi -- in Cayman's spec's with only 32 ROPs but 30 SIMDs (48 pixels?), regarding pixel throughput from the fragment pipeline to the back-end?

p.s.:

In the pic above this is what i see for Cayman:
480(x4)
96
32
3

I am wondering what would be bart's and cayman's die sizes with those specs.

no-X · Sep 30, 2010

racca said:
You said nothing about mid/high range. XX70 fits ALL. Your statement is false, period.
If you said X870/X770, I would have agreed, but you didn't

You could notice, that we're discussing future midrange / high-end for weeks. There's no reason to imply in every single post, that the discussion isn't related to low-end

CarstenS · Sep 30, 2010

While I find all the speculation about resurrecting yesteryears concepts very interesting, allow me to point one thing out: Quite likely Islands development has started after AMD took over and maybe already with fusion concepts in mind. So maybe we need to take more possibilities into account?

flopper · Sep 30, 2010

CarstenS said:
While I find all the speculation about resurrecting yesteryears concepts very interesting, allow me to point one thing out: Quite likely Islands development has started after AMD took over and maybe already with fusion concepts in mind. So maybe we need to take more possibilities into account?

so amd implemented several smaller cores already?

neliz · Sep 30, 2010

flopper said:
so amd implemented several smaller cores already?

If Fusion was anything similar to modular, it would be relatively "easy" to insert an updated DX11 core into an existing design, right? Something like UVD3, a N.I. feature that's also available on Fusion suggests that.

Jawed · Sep 30, 2010

I think TMUs are probably 96, making them local to the RPEs, and not globally shared. Much as I'd like to see the TMUs and ROPs sharing L1 and some ALUs, I don't think it's gonna happen here.

racca · Sep 30, 2010

If said speculation were true, ie. improvements over TMU-sharing/setup/rasterizer/$, and 4D is quite close to 5D in terms of throughput, then perhaps Barts can beat Cypress clock for clock after all.
Not quite justifying the 6870 name, but it's a start.

racca · Sep 30, 2010

no-X said:
You could notice, that we're discussing future midrange / high-end for weeks. There's no reason to imply in every single post, that the discussion isn't related to low-end

No. We are discussing (damn near yet unclear) future architecture for weeks. High/mid end parts get more attention for sure, but you can rule anything out.
Plus i listed many firsts back there, who's to say 6800 isn't gonna be the next?
Say, if Barts were to be named 6800, that's got to be a first anyway, isn't it? AMD would not be following their "tradition", hence your argument has no ground.
And BTW you don't have to specify in every post, because most of the posts either can apply to mid to low end parts or have a code/product name in it.

So let's just stop here, alright.

Squilliam · Sep 30, 2010

fellix said:
Is it me, or there's a potential imbalance -- an opposite case to Fermi -- in Cayman's spec's with only 32 ROPs but 30 SIMDs (48 pixels?), regarding pixel throughput from the fragment pipeline to the back-end?

p.s.:

Funny thing is given the shape of the first number im tempted to think that if its true then it is showing 640 rather than 480 stream processors. Perhaps this is a nod to their professional / HPC markets in that they are giving the one SKU a large number of stream processors because it is relevant to these markets as well. Barts doesn't have to cross into the same markets and therefore can stick with a more balanced architecture.

mczak · Sep 30, 2010

racca said:
If said speculation were true, ie. improvements over TMU-sharing/setup/rasterizer/$, and 4D is quite close to 5D in terms of throughput, then perhaps Barts can beat Cypress clock for clock after all.

Well, if that's really 320x4 shaders, why not - after all the chip already seems to have the same ROP capabilities as Cypress.
But rumors still are conflicting, and I haven't seen some credible die size numbers neither - that should possibly give some more indication what performance might be expected.

AMD: R9xx Speculation

MarkoIt

Jawed

fellix

no-X

Alexko

Kaotik

Drunk Member

3dilettante

racca

racca

GZ007

DeF

no-X

CarstenS

Moderator

flopper

neliz

GIGABYTE Man

Jawed

racca

racca

Squilliam

Beyond3d isn't defined yet

mczak

Similar threads