AMD: R9xx Speculation

For one thing, they have high current power gating on SOI (I don't think anyone does it in bulk except for Intel).
 
I know, but at one point or another gpu functionality, at least for low end gpu power, will have to move to SOI. Fusion is not that far away.

Is it out of the box to think that graphic functionality will come from NI architecture?
Llano gpu is a evergreen derivative.

Fusion chips will be on SOI.
 
For one thing, they have high current power gating on SOI (I don't think anyone does it in bulk except for Intel).

With increasing leakage power, I think TSMC will have to find a way to do it in bulk too. Of course, things will get a *big* push @tsmc if gf can manage it. :smile:
 
My bets for Evergreen+1:

a) More local memory, 32->64 KB at minimum.

b) 28-32 SIMD's

c) Semi coherent R/W cache hierarchy

d) Cached atomics.

e) Actually a wish: Unification of atleast 2 out of {Reg file, local memory, L1 cache} pools. All 3 unified will be a Christmas gift. :smile:

f) Unified address space.
 
R/W caching will be there for sure, something like 32K per SIMD and 48~64K LDS alone. The SRAM cell density on 28nm bulk process should accommodate such expansion in a moderate die size.
 
e) Actually a wish: Unification of atleast 2 out of {Reg file, local memory, L1 cache} pools. All 3 unified will be a Christmas gift. :smile:
While I understand why and how to make reg file and local memory OR local memory and L1 cache unified I cant understand why L1 ache and reg file should be unified, what's the advantage? How to do it?
 
While I understand why and how to make reg file and local memory OR local memory and L1 cache unified I cant understand why L1 ache and reg file should be unified, what's the advantage? How to do it?

The advantage is flexibility and higher overall utilization. It will make the performance cliffs much more gradual. I'll prefer to have L1 and local mem unified, though I guess reg file and local mem is an easier target.

How to do it? Well, the only answer to that I have right now is a variant of LRB1's memory system.
 
The advantage is flexibility and higher overall utilization. It will make the performance cliffs much more gradual. I'll prefer to have L1 and local mem unified, though I guess reg file and local mem is an easier target.

How to do it? Well, the only answer to that I have right now is a variant of LRB1's memory system.
One question though -> how could they cope with the occurrence of simultaneous accesses to both L1, LDS and register file, which should happen quite often?

I think LDS needs to be separate, so needs reg file, else you'll end up with a bandwidth issue.
 
The LDS crossbars are nasty pieces of wiring ... do you really want to tie that kind of hardware to GPRs? (Which are accessed at far greater throughput than the LDS)

Combining it with cache as Fermi did makes some sense, GPR I don't think so.
 
The key to GPRs in ATI is that their data is private to the owning ALU - though there's a bus that connects the GPRs to the TUs, LDS and import/export blocks. That owning-ALU constraint makes the high throughput of the GPRs possible. Full-speed GPR data from arbitrary locations in the register file to arbitrary ALUs would be a nightmare.

NVidia does this of course, because it has horizontal and vertical GPR addressing. This isn't such a nightmare because the ALUs are less greedy.

Jawed
 
The key to GPRs in ATI is that their data is private to the owning ALU - though there's a bus that connects the GPRs to the TUs, LDS and import/export blocks. That owning-ALU constraint makes the high throughput of the GPRs possible. Full-speed GPR data from arbitrary locations in the register file to arbitrary ALUs would be a nightmare.
Can you expand a bit more on "The key to GPRs in ATI is that their data is private to the owning ALU"?
 
Can you expand a bit more on "The key to GPRs in ATI is that their data is private to the owning ALU"?
http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf

You can see how the register files are disjoint.

http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

Section 2.6, Figure 2.1 is a logical view of the architecture, showing that each of the 64 processors in a SIMD has 256 128-bit registers.

The "global" shared registers are described in 2.6.2.1, which is at pains to point out that each such GPR is only available for "threads" (ALUs) on that lane. This restriction wouldn't apply if register data could be arbitrarily shared across ALUs. Also note that figure 2.2 shows how the register file is really a big pool shared by all wavefronts and split amongst global shared registers, clause temporary registers and bog-standard registers.

Additionally, of course, LDS is an entirely separate structure, distinct from the register files. It is also very low bandwidth, compared with the GPRs. Evergreen's LDS has better bandwidth as a direct result of having twice as many banks, and presumably at increased bus cost getting that data to/from the GPRs.

Technically it appears possible/likely, that each VLIW ALU (set of x,y,z,w,t) has four 256-high x 128-bit register files. The old register file patents refer to 256 high blocks of memory, if I remember right. The whole thing is knitted together in a staggered timing cycle of 8 clocks (in two sets of 4 clocks) to ensure that all GPR clients get their share, centred upon the execution pipeline's 8 cycle interval, slide 9:

http://gpgpu.org/wp/wp-content/uploads/2009/09/E1-OpenCL-Architecture.pdf

Page 10, too:

http://sa09.idav.ucdavis.edu/docs/SA09_AMD_IHV.pdf

That should be all you need.

Jawed
 
One question though -> how could they cope with the occurrence of simultaneous accesses to both L1, LDS and register file, which should happen quite often?
No. Fermi has a load-store ISA, and it doesn't need to do simultaneous access to any of the three memory pools.
 
Sorry, I must have missed something, didn't LRB1 had separated reg files and caches?
LRB's register file is too small (32 float16 registers per hw thread) to hold the context for the entire work group. The context is really stored in the per core cache.
 
It will be interesting if they will go for 3-rd time with 256bit gddr5 ;).
It doesnt seem like gddr will evolve into something even faster without much wider bus width in future.
Something like ati-s xenos edram could be quite refreshing.(altough quite problematic for lower end parts :p) And with xenos like additional logic the main gpu die could be smaler or pack more simds.
 
Back
Top