AMD: R9xx Speculation

MfA · Mar 16, 2010

For one thing, they have high current power gating on SOI (I don't think anyone does it in bulk except for Intel).

rpg.314 · Mar 16, 2010

tannat said:
I know, but at one point or another gpu functionality, at least for low end gpu power, will have to move to SOI. Fusion is not that far away.

Is it out of the box to think that graphic functionality will come from NI architecture?

Llano gpu is a evergreen derivative.

Fusion chips will be on SOI.

rpg.314 · Mar 16, 2010

MfA said:
For one thing, they have high current power gating on SOI (I don't think anyone does it in bulk except for Intel).

With increasing leakage power, I think TSMC will have to find a way to do it in bulk too. Of course, things will get a *big* push @tsmc if gf can manage it. :smile:

rpg.314 · Mar 16, 2010

My bets for Evergreen+1:

a) More local memory, 32->64 KB at minimum.

b) 28-32 SIMD's

c) Semi coherent R/W cache hierarchy

d) Cached atomics.

e) Actually a wish: Unification of atleast 2 out of {Reg file, local memory, L1 cache} pools. All 3 unified will be a Christmas gift. :smile:

f) Unified address space.

nAo · Mar 16, 2010

Their global atomics are already cached.

fellix · Mar 16, 2010

R/W caching will be there for sure, something like 32K per SIMD and 48~64K LDS alone. The SRAM cell density on 28nm bulk process should accommodate such expansion in a moderate die size.

rpg.314 · Mar 16, 2010

nAo said:
Their global atomics are already cached.

Oh yeah, I forgot. But more cache will be quite welcome.

EduardoS · Mar 16, 2010

rpg.314 said:
e) Actually a wish: Unification of atleast 2 out of {Reg file, local memory, L1 cache} pools. All 3 unified will be a Christmas gift. :smile:

While I understand why and how to make reg file and local memory OR local memory and L1 cache unified I cant understand why L1 ache and reg file should be unified, what's the advantage? How to do it?

rpg.314 · Mar 16, 2010

EduardoS said:
While I understand why and how to make reg file and local memory OR local memory and L1 cache unified I cant understand why L1 ache and reg file should be unified, what's the advantage? How to do it?

The advantage is flexibility and higher overall utilization. It will make the performance cliffs much more gradual. I'll prefer to have L1 and local mem unified, though I guess reg file and local mem is an easier target.

How to do it? Well, the only answer to that I have right now is a variant of LRB1's memory system.

EduardoS · Mar 16, 2010

rpg.314 said:
How to do it? Well, the only answer to that I have right now is a variant of LRB1's memory system.

Sorry, I must have missed something, didn't LRB1 had separated reg files and caches?

PSU-failure · Mar 16, 2010

rpg.314 said:
The advantage is flexibility and higher overall utilization. It will make the performance cliffs much more gradual. I'll prefer to have L1 and local mem unified, though I guess reg file and local mem is an easier target.

How to do it? Well, the only answer to that I have right now is a variant of LRB1's memory system.

One question though -> how could they cope with the occurrence of simultaneous accesses to both L1, LDS and register file, which should happen quite often?

I think LDS needs to be separate, so needs reg file, else you'll end up with a bandwidth issue.

MfA · Mar 16, 2010

The LDS crossbars are nasty pieces of wiring ... do you really want to tie that kind of hardware to GPRs? (Which are accessed at far greater throughput than the LDS)

Combining it with cache as Fermi did makes some sense, GPR I don't think so.

Jawed · Mar 16, 2010

The key to GPRs in ATI is that their data is private to the owning ALU - though there's a bus that connects the GPRs to the TUs, LDS and import/export blocks. That owning-ALU constraint makes the high throughput of the GPRs possible. Full-speed GPR data from arbitrary locations in the register file to arbitrary ALUs would be a nightmare.

NVidia does this of course, because it has horizontal and vertical GPR addressing. This isn't such a nightmare because the ALUs are less greedy.

Jawed

nAo · Mar 16, 2010

Jawed said:
The key to GPRs in ATI is that their data is private to the owning ALU - though there's a bus that connects the GPRs to the TUs, LDS and import/export blocks. That owning-ALU constraint makes the high throughput of the GPRs possible. Full-speed GPR data from arbitrary locations in the register file to arbitrary ALUs would be a nightmare.

Can you expand a bit more on "The key to GPRs in ATI is that their data is private to the owning ALU"?

Jawed · Mar 17, 2010

nAo said:
Can you expand a bit more on "The key to GPRs in ATI is that their data is private to the owning ALU"?

http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf

You can see how the register files are disjoint.

http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

Section 2.6, Figure 2.1 is a logical view of the architecture, showing that each of the 64 processors in a SIMD has 256 128-bit registers.

The "global" shared registers are described in 2.6.2.1, which is at pains to point out that each such GPR is only available for "threads" (ALUs) on that lane. This restriction wouldn't apply if register data could be arbitrarily shared across ALUs. Also note that figure 2.2 shows how the register file is really a big pool shared by all wavefronts and split amongst global shared registers, clause temporary registers and bog-standard registers.

Additionally, of course, LDS is an entirely separate structure, distinct from the register files. It is also very low bandwidth, compared with the GPRs. Evergreen's LDS has better bandwidth as a direct result of having twice as many banks, and presumably at increased bus cost getting that data to/from the GPRs.

Technically it appears possible/likely, that each VLIW ALU (set of x,y,z,w,t) has four 256-high x 128-bit register files. The old register file patents refer to 256 high blocks of memory, if I remember right. The whole thing is knitted together in a staggered timing cycle of 8 clocks (in two sets of 4 clocks) to ensure that all GPR clients get their share, centred upon the execution pipeline's 8 cycle interval, slide 9:

http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf

Page 10, too:

http://sa09.idav.ucdavis.edu/docs/SA09_AMD_IHV.pdf

That should be all you need.

Jawed

rpg.314 · Mar 17, 2010

PSU-failure said:
One question though -> how could they cope with the occurrence of simultaneous accesses to both L1, LDS and register file, which should happen quite often?

No. Fermi has a load-store ISA, and it doesn't need to do simultaneous access to any of the three memory pools.

rpg.314 · Mar 17, 2010

EduardoS said:
Sorry, I must have missed something, didn't LRB1 had separated reg files and caches?

LRB's register file is too small (32 float16 registers per hw thread) to hold the context for the entire work group. The context is really stored in the per core cache.

GZ007 · Mar 17, 2010

It will be interesting if they will go for 3-rd time with 256bit gddr5

.
It doesnt seem like gddr will evolve into something even faster without much wider bus width in future.
Something like ati-s xenos edram could be quite refreshing.(altough quite problematic for lower end parts

) And with xenos like additional logic the main gpu die could be smaler or pack more simds.

Tchock · Mar 17, 2010

Err... differential signaling on GDDR5 coming?

Alexko · Mar 17, 2010

Tchock said:
Err... differential signaling on GDDR5 coming?

That certainly seems to be the case. It is pretty much what Charlie hinted at, by the way.

AMD: R9xx Speculation

MfA

rpg.314

rpg.314

rpg.314

nAo

Nutella Nutellae

fellix

rpg.314

EduardoS

rpg.314

EduardoS

PSU-failure

MfA

Jawed

nAo

Nutella Nutellae

Jawed

rpg.314

rpg.314

GZ007

Tchock

Alexko

Similar threads