AMD: R9xx Speculation

Alexko · Sep 29, 2010

Isn't that Caicos on the left? Looks like 32 TMUs and 16 ROPs. Can't really make out the number of SPs, though, but I guess 160×4 is a safe bet.

Man from Atlantis · Sep 29, 2010

so RPE is AMD's GPC's.. it means twice triangle power twice tessellation power than previous 5xxx..

Nakai · Sep 29, 2010

Turks? Where is Turks?

Can't really make out the number of SPs, though, but I guess 160×4 is a safe bet.

I would say, anything between 100 and 160. It doesnt look like alle RPEs must have the same amount of SPs.

RPE sounds very much like marketing...

Arnold Beckenbauer · Sep 29, 2010

jaredpace said:
RPE?

My idea: One RPE is one "ultra threaded dispatch processor" and one rasterizer (and one Shader/TMU block)..

Kaotik · Sep 29, 2010

Alexko said:
Isn't that Caicos on the left? Looks like 32 TMUs and 16 ROPs. Can't really make out the number of SPs, though, but I guess 160×4 is a safe bet.

That would mean Turks is the slowest chip, which doesn't make much sense but it's just name so who knows

edit:
IMO it looks like this
Caicos - Barts - Cayman
160x4 - 320x4 - 480x4
32 - 64 - 96
16 - 32 - 48 (could be 32 too but 48 would "fit" the rest better)
1 - 2 - 3

Psycho · Sep 29, 2010

RPE could just be like Cypress's dual setup, it doesn't have to be all the way like the GPCs.
Is Cayman 96/32/3 behind the blur? It's not 4 at least.

fellix · Sep 29, 2010

If Cayman is 480 4-way SPs, that makes 30x16-way SIMDs or 10 SIMDs per RPE, for 3xRPEs.

Jawed · Sep 29, 2010

Arnold Beckenbauer said:
My idea: One RPE is one "ultra threaded dispatch processor" and one rasterizer (and one Shader/TMU block)..

Which is very much the same as Evergreen, only called "shader engine".

DavidGraham · Sep 29, 2010

Psycho said:
RPE could just be like Cypress's dual setup, it doesn't have to be all the way like the GPCs.

It could , however , it is more logical that a RPE is like a GPC , because it scales perfectly with core count :

Caicos : 1 RPE = 640 SPs
Barts : 2 RPE = 1280 SPs (640x2)
Cayman : 3 RPE = 1920 SPs (640x3)

Arnold Beckenbauer · Sep 29, 2010

Jawed said:
Which is very much the same as Evergreen, only called "shader engine".

Yeah. But the problem is still the same: 32 TMUs and 160x4 TPs - this can't work (RV770 style). Or R600 is back.

Psycho · Sep 29, 2010

As long as the slides are sufficiently blured we can make sensible configurations out of them, instead of calling fake due to inconsistent numbers

Sontin · Sep 29, 2010

DavidGraham said:
It could , however , it is more logical that a RPE is like a GPC , because it scales perfectly with core count :

Caicos : 1 RPE = 640 SPs
Barts : 2 RPE = 1280 SPs (640x2)
Cayman : 3 RPE = 1920 SPs (640x3)

The problem is you know what nvidia did with Fermi. Yet you think that AMD will do the same.

DavidGraham · Sep 29, 2010

Psycho said:
As long as the slides are sufficiently blured we can make sensible configurations out of them, instead of calling fake due to inconsistent numbers

Makes me wonder .. why blur the rest of the specs ? I understand the need for blurring the superior and inferior parts , but why the specs ?

fellix said:
If Cayman is 480 4-way SPs, that makes 30x16-way SIMDs or 10 SIMDs per RPE, for 3xRPEs.

At least that took care of the wavefront problem , it sets now at 64 as it should be .

Arnold Beckenbauer · Sep 29, 2010

fellix said:
If Cayman is 480 4-way SPs, that makes 30x16-way SIMDs or 10 SIMDs per RPE, for 3xRPEs.

Don't forget the TMUs. 96 TMUs and 30 SIMDs - this can't work.

Or we go back to the R600 style but this time with 2 clocks latency.
So one RPE has than: one TMU-SIMD (32 TMUs) and 5 32-way-SIMDs.

DavidGraham · Sep 29, 2010

Arnold Beckenbauer said:
Or we go back to the R600 style but this time with 2 clocks latency.
So one RPE has than: one TMU-SIMD (32 TMUs) and 10 32-way-SIMDs.

Wasn't R600 just like Rv770 ? i.e: it used 4 shader clusters (80 SPs each) , with a texture quad block for each cluster ?

fellix · Sep 29, 2010

Yea, the TMU count is still two-digit number.

Jawed · Sep 29, 2010

Arnold Beckenbauer said:
Yeah. But the problem is still the same: 32 TMUs and 160x4 TPs - this can't work (RV770 style).

That's no different from 64 TMUs and 320x4 ALU lanes.

In both cases it seems an RPE consists of 10 SIMDs. With 32 TMUs.

Which would imply TMUs are shared within an RPE by all the SIMDs ...

Or R600 is back.

... or something along the lines of the patents I've been talking about, where TMUs are shared by SIMDs. The patents talk about a "processor" producing two filtered results independently and also sharing texel data (unfiltered texels, not texel results) amongst L1s, with 2 TMUs seemingly sharing an L1. Those two concepts would appear to tally with this peculiar setup.

R600 shares only results, I think, not original texel data (or, if you prefer, texel data isn't shared amongst L1s, only amongst L2s). Though it would be funny if a ring-bus appeared.

Barts Pro, presumably, has SIMDs turned off. It presumably also has TMUs turned off. So with SIMDs being much larger than TMUs, it's likely that while only 1 quad-TMU per RPE is turned off, 2 or more SIMDs would be turned off.

e.g. 1024 ALU lanes and 56 TMUs.

I have to admit I've got a queasy feeling about the "non-integer-multiple" SIMD:quad-TMU thing going on here.

Arnold Beckenbauer · Sep 29, 2010

DavidGraham said:
Wasn't R600 just like Rv770 ? i.e: it used 4 shader clusters (80 SPs each) , with a texture quad block for each cluster ?

This is the main difference: The R600 design had a decoupled TMU-SIMD. So the R600 had five SIMDs: one TMU-SIMD and four Shader-SIMDs.

fellix · Sep 29, 2010

Is it me, or there's a potential imbalance -- an opposite case to Fermi -- in Cayman's spec's with only 32 ROPs but 30 SIMDs (48 pixels?), regarding pixel throughput from the fragment pipeline to the back-end?

p.s.:

DavidGraham · Sep 29, 2010

Arnold Beckenbauer said:
This is the main difference: The R600 design had a decoupled TMU-SIMD. So the R600 hat five SIMDs: one TMU-SIMD and four Shader-SIMDs.

I see , thanks for the heads up .

Is there a possibility that we are tackling the wrong side of the problem ? Cayman could have 24 SIMDS (80SPs each) and maintain the right texture arrangement . (4x24 = 96) ?

I know the wavefront problem would persist , but what is more likely ? a change in wavefront (with subsequent load on the compiler possibly degrading performance ) or a change in texture quads arrangement ? what is the least harmful option ? doesn't TMU sharing add latency and conflicts?

AMD: R9xx Speculation

Alexko

Man from Atlantis

Nakai

Arnold Beckenbauer

Kaotik

Drunk Member

Psycho

fellix

Jawed

DavidGraham

Arnold Beckenbauer

Psycho

Sontin

DavidGraham

Arnold Beckenbauer

DavidGraham

fellix

Jawed

Arnold Beckenbauer

fellix

DavidGraham

Similar threads