AMD: R9xx Speculation

3dcgi · Apr 23, 2010

racca said:
You couldn't have been more wrong. "Nodes" are just a way to focus foundries' investments.
A full-node implement reqiures new development tools, new litho-tools, new materials, etc. That's exactly what TSMC did with 40nm and 28nm. So it IS full node by all definition at least for 28nm. -- 32nm@TSMC was an optical shrink of 40nm it had no relationship with 28nm while it was planned.

Let's put it this way: would you call Phenom II a new uArch, just because it's released around the same time as Core i7 which has a considerably new uArch?

On the other hand most other foundries develop new tools at 32nm node then do an optical "shrink" with almost the same technology.

If you want to call the first shipping product the full node then there's no arguing with that beyond arguing the definition of a node.

32nm was most definitely not an optical shrink. The only full node and optical shrink combo TSMC has shipped in recent years was 65/55 nm. For 55nm customers used 65nm synthesis libraries and PD tools shrunk the design.

Tchock · Apr 23, 2010

Err... 400mm^2 on a X700 part is just madness. That's the second biggest chip ATI has ever made.

It's an 68/5990 I reckon. And the 28nm shrink/Globalfoundries part is a new family.

neliz · Apr 23, 2010

my 4x should be 2x

too busy and posting too hurried.)

psolord · Apr 23, 2010

neliz said:
my 4x should be 2x too busy and posting too hurried.)

Are you referring to the 6800 flops = 4X GTX 480 flops?

If so, then it should be 6800 flops = 2X GTX 480 flops, which is exactly the same as the 5870.

Maybe there will be an increase in efficiency though.

Just try to find out how many transistors (I really laugh when we call them "trannies") the 6800 will have. This is the most important spec.

rpg.314 · Apr 23, 2010

MfA said:
If the GPU approach makes sense, but you start adding caches isn't there another compromise ... might it be possible to just combine the registers with the cache in a unified pool? (Along with local storage.)

Larrabee's approach may or may not be scalable into the future, but it certainly showed that it is possible to unify the 3 pools of memory.

neliz · Apr 23, 2010

psolord said:
Are you referring to the 6800 flops = 4X GTX 480 flops?

If so, then it should be 6800 flops = 2X GTX 480 flops, which is exactly the same as the 5870.

Maybe there will be an increase in efficiency though.

Just try to find out how many transistors (I really laugh when we call them "trannies") the 6800 will have. This is the most important spec.

a 6700 would be faster than GTX480 and 5870. So a 6700 would have around 3TFlop @40nm and a 6800 would have 6TFlop at 28nm but not show up until somewhere in 2011.
I think "increased efficiency" is the keyword of the coming AMD designs.

psolord · Apr 23, 2010

neliz said:
a 6700 would be faster than GTX480 and 5870. So a 6700 would have around 3TFlop @40nm and a 6800 would have 6TFlop at 28nm but not show up until somewhere in 2011.
I think "increased efficiency" is the keyword of the coming AMD designs.

So the "6770" should be something like a 5870 but with increased efficiency that would make it faster!?

The 5770 was also something like the 4870 but it turned out a bit slower really, quite possibly due to the memory bus.

This new analogy, may indicate that ATI wants and can turn the scales by quite a bit.

What worries me are the prices that will come with these products. I have the feeling that ATI is slowly going the Nvidia way! :S

Jawed · Apr 23, 2010

MfA said:
My problem is that at the size of their register sets they are very similar to caches already ... the access method differs somewhat, the access of caches takes more pipeline stages and power for instance, but at the size of GPU's register sets they start behaving more like caches than "normal" registers in access time (exemplified by the PV/PS registers).

They're sort of sized like we see L2s in x86 CPUs - but in ATI they're incredibly fragmented.

Each ALU (XYZWT) has 4 private 256 entry 128-bit register files. This means there are actually 1280 lickle register files in Cypress - all with independent data paths and cycle mappings (4-way stagger is an extra complication - I think causing an effective 8-cycle latency, which is where PV/PS come in). I presume the addressing is ganged, within a SIMD, though.

NVidia's design is rather different. It's a single address space (i.e. an operand collector accesses every address) with 16 or 32 banks.

So, NVidia's design is more like a cache. Or if you prefer it's more of a gather/scatter architecture on every cycle.

It's not like the "normal" register sets like say Larrabee has. I'm not convinced however that what Larrabee is doing makes sense, I think the majority of L1 cache accesses will be pushing and popping while running shaders.

I have minor qualms over Larrabee's register file/L1 marriage, too. I know practically nothing about SSE but Larrabee's vector unit seems to me to be a 4-threaded super-wide SSE on steroids - with a much nicer instruction set. So the 4-way register file is really just a scratchpad - which is really what registers were always about, originally, anyway.

If the GPU approach makes sense, but you start adding caches isn't there another compromise ... might it be possible to just combine the registers with the cache in a unified pool? (Along with local storage.)

I used to think along these lines. It's quite a tempting view.

NVidia's architecture is closer to this. If you're going to do that then you need to be able to stand some pretty high latency. NVidia's design (at least G80-GT200 - still unclear on what Fermi's doing) with operand scoreboarding is closer. But NVidia still sees fit to keep them separate. At least for the time being.

I can't help thinking that the Larrabee approach is where ATI and NVidia will end up. A key feature in Larrabee is a single operand optionally coming directly from L1 (subject to waterfalling, of course).

Lets say that instead of normal registers we give each SC a window on the cache? The method of access would simply use direct indexing, so it would bypass the normal tag comparisons etc. which make the accesses expensive time and power wise. NVIDIA already did this for local storage, why not for registers as well?

Fundamentally its about bandwidth and hiding latencies of register gather/scatter. Notice that shared memory has lower throughput.

Larrabee's trade is "narrow" hardware threads (single cycle per intruction, arguably each vector unit is actually natively working on a single work item - though it can also be viewed as 16 work items), software managed fibres and no pretence of ever holding a work item's entire context in registers (though it's technically possible, of course).

So in Larrabee, L1 and registers both fundamentally act as operand collectors/store-queues, implementing a hierarchy of gather/scatter.

Jawed

Jawed · Apr 23, 2010

neliz said:
a 6700 would be faster than GTX480 and 5870. So a 6700 would have around 3TFlop @40nm and a 6800 would have 6TFlop at 28nm but not show up until somewhere in 2011.
I think "increased efficiency" is the keyword of the coming AMD designs.

Need to stop equating FLOPs with performance. I'm not denying that the next ALU configuration is more efficient, but that doesn't mean that the next chip actually has more FLOPs than HD5870. e.g. HD6770 could be 2.2 TFLOPs theoretical peak and be faster than HD5870 on most games.

Jawed

w0mbat · Apr 23, 2010

next week im going to meet someone in kuala lumpur who knows quiet a bit about ati@tsmc. so if charlie is right and there has been a new 40nm ati chip tapeout there i can confirm it (or not). my "contacts" here in singapore are not in the know...

racca · Apr 23, 2010

Jawed said:
If this is really HD6770, then it's a cut-down part which, due to the demise of 32nm, is a bit larger than it should have been... In my view that makes it likely to have no more ALUs. And if I'm right about the way things pan out with an XYZW configuration, performance won't be notably affected.

We'll be arguing about its ALU count right up until the moment we know precisely what it is

The real replacement for HD5870, HD6870, theoretically only appears once 28nm is ready. Something like 18 months after HD5870?

I don't think that's the case, I agree (mostly) with you about the SIMDs, but if you'd agree that one SIMD (along with TMUs, etc) per se won't have much impact on die size, neither do ROPs/MCs and quite possibly UTDP -- all of which should account for at least 80% of the die size.

If SIMD count stays 20, let's say 20SIMD/TMU+MC+ROP+UTDP etc occupies almost the same die area as Cypress, then where does the extra 20% size go? It's highly unlikely the front end could be more than double that of Cypress with reduced raw shader power.

Jawed · Apr 23, 2010

racca said:
If SIMD count stays 20, let's say 20SIMD/TMU+MC+ROP+UTDP etc occupies almost the same die area as Cypress, then where does the extra 20% size go? It's highly unlikely the front end could be more than double that of Cypress with reduced raw shader power.

In my view in order for triangle rate to increase rasterisation rate for small triangles needs to increase.

My theory is that hardware threads can only accept a single triangle. If that's true then rasterisation rate for small triangles can only increase if the way hardware threads are constructed is also revised. If a hardware thread can accept up to 16 triangles at the rate of 4 triangles per clock (since a thread takes 4 clocks anyway) then rasterisation and barycentric stuff all needs to be at least 4x faster.

All told that's potentially quite a lot of interdependent new stuff, e.g. hierarchical-Z needs to scale within screen-space tiles, not merely across tiles.

Then there's also the question of a revised memory architecture. e.g. bigger L2s. L2s that are generalised like in Fermi. etc.

Also, I reckon it's about time for 8x Z. 4x Z is so 2006.

--

I've not seen this patent document before, it's called "UNIFIED TESSELLATION CIRCUIT AND METHOD THEREFOR"

http://v3.espacenet.com/publication...=A1&FT=D&date=20100304&DB=EPODOC&locale=en_gb

Might have some clues in it...

Jawed

MfA · Apr 23, 2010

Jawed said:
NVidia's design is rather different. It's a single address space (i.e. an operand collector accesses every address)

Wait ... it's not possible to do indexed accessing of registers right? Threadblocks which aren't a multiple of 64 threads waste register space right?

Jawed · Apr 23, 2010

Assuming you're querying ATI's architecture:

MfA said:
Wait ... it's not possible to do indexed accessing of registers right?

The indexing is per lane ID.

SRs ("global registers" - registers shared across all hardware threads within a SIMD) are shared across hardware threads, but private to lane ID - e.g. hardware thread 3 lane 12 shares five SRs with lane 12 in 7 other hardware threads (if there are 8 hardware threads in total).

Lane 28 (same XYZWT ALU as work item 12, but different register file) has 5 SRs that are distinct from lane 12's.

Threadblocks which aren't a multiple of 64 threads waste register space right?

Sure.

In fact if you are only running one shader on the GPU (compute shader is the normal case) and you have say 34 GPRs allocated per work item, resulting in 7 hardware threads per SIMD, the hardware can only allocate 7 threads * 64 lanes * 34 registers = 15232 GPRs. The total capacity is 16384 GPRs, so wasting 1152 GPRs (18KB).

Some of that waste will be clawed back in the use of clause temporary registers. e.g. if 8 clause temporaries are defined, then they will consume 2 hardware threads * 64 lanes * 8 registers = 1024 GPRs.

So in this example the wastage is reduced to 128 GPRs, 2KB.

Jawed

MfA · Apr 23, 2010

Jawed said:
Assuming you're querying ATI's architecture

That's a rather strange assumption since I quoted a statement about NVIDIA, I meant NVIDIA ...

For NVIDIA ... it's not possible to do indexed accessing of registers right? Threadblocks which aren't a multiple of 64 threads waste register space right?

3dilettante · Apr 23, 2010

Jawed said:
I used to think along these lines. It's quite a tempting view.

NVidia's architecture is closer to this. If you're going to do that then you need to be able to stand some pretty high latency. NVidia's design (at least G80-GT200 - still unclear on what Fermi's doing) with operand scoreboarding is closer. But NVidia still sees fit to keep them separate. At least for the time being.

Nvidia has moved further away from unifying register and memory pools. The description of Fermi's ISA change has moved to a more fully load/store architecture, whereas its immediate predecessor had memory operands.

Why expose every operand access to possible TLB fills and memory faults, or why have the additional complexity in hardware to do this, and then avoid using it most of the time?

I can't help thinking that the Larrabee approach is where ATI and NVidia will end up. A key feature in Larrabee is a single operand optionally coming directly from L1 (subject to waterfalling, of course).

If it weren't for the x86 core, x86 hardware thread context, comparatively miniscule reg file and its reg/memory operand legacy, I wonder if its designers would have skipped over that "feature".

A memory access is not as cheap as a register file access, for various reasons. It is a much more complex case to get right, and getting it wrong has much bigger consequences for the system in general. The load/store and execution pipelines of even the P55 core are at least somewhat more complex because of this.

I wouldn't mind accessing that pool of SRAM, perhaps in some kind of linear line access absent the TLB and fault handling part of the pipeline, but those are usually an integral part of the pipeline and not totally removable.
I would be curious if Nvidia's configurable L1 does somehow convert cache accesses to the shared memory region into something addressed to the physical lines of the cache, though it could just be some kind of creative page mapping, where the cache logic does not bother to keep it coherent.

So in Larrabee, L1 and registers both fundamentally act as operand collectors/store-queues, implementing a hierarchy of gather/scatter.

I would potentially disagree, if I knew more of the implementation. It's possible that Larrabee would already have store queues as explicit parts of its memory pipeline.
The L1 and registers are just what they are, and what heirarchy they implement is what any other fully fleshed out memory pipeline can provide with proper software usage.

Jawed · Apr 23, 2010

MfA said:
That's a rather strange assumption since I quoted a statement about NVIDIA, I meant NVIDIA ...

But you mentioned 64, not 32 (which is the "warp" size in NVidia, although 16 is, at least for some GPUs, the hardware thread size), so I thought you were querying ATI's architecture in comparison with NVidia.

For NVIDIA ... it's not possible to do indexed accessing of registers right? Threadblocks which aren't a multiple of 64 threads waste register space right?

Patent documents indicate that GPRs can be allocated horizontally (a warp's single or multiple registers can occupy all banks) or vertically (each register is allocated along a bank) or mixed.

I don't know if there's indexing. I have a vague memory of it being a part of the architecture, but that's all.

I would tend to suspect it's not, thinking more: register allocation in NVidia has always been very tight. Secondly, with Fermi's register spill through the memory hierarchy, they might simply have chosen to use addressing for indexing. Dunno really.

Any time the domain of execution defines a count of work-items that isn't an exact multiple of the hardware thread size, you'll get register file wastage. Just the same as killed pixels in pixel shading will waste registers.

Jawed

rpg.314 · Apr 23, 2010

Jawed said:
They're sort of sized like we see L2s in x86 CPUs - but in ATI they're incredibly fragmented.

Each ALU (XYZWT) has 4 private 256 entry 128-bit register files. This means there are actually 1280 lickle register files in Cypress - all with independent data paths and cycle mappings (4-way stagger is an extra complication - I think causing an effective 8-cycle latency, which is where PV/PS come in). I presume the addressing is ganged, within a SIMD, though.

Has to be. All the the 16 lanes in a thread have to access the same register in their respective banks, with additional banking over the 4 xyzw slots. Independent register fetch from each lane would be quite a bit of overhead. In that sense, it is not very different from nv's scatter gather every cycle.

I can't help thinking that the Larrabee approach is where ATI and NVidia will end up. A key feature in Larrabee is a single operand optionally coming directly from L1 (subject to waterfalling, of course).

They have to. The present segregation b/w private, local and global caches is just too wasteful.

Jawed · Apr 23, 2010

3dilettante said:
Nvidia has moved further away from unifying register and memory pools. The description of Fermi's ISA change has moved to a more fully load/store architecture, whereas its immediate predecessor had memory operands.

I don't understand the distinction you're making.

Why expose every operand access to possible TLB fills and memory faults, or why have the additional complexity in hardware to do this, and then avoid using it most of the time?

This is where you get into a nebulous argument over whether the memory in the operand collector, holding operands for multiple cycles until a warp's worth of operands are all populated, is really the register file :???:

In this model the "registers", the constant cache, the shared memory and global memory are all just addressable memories.

The thing that makes these GPUs different from CPUs is that gather/scatter is essentially a first-class instruction. Or, at least in the future, it is. There's no choice when the whole thing is a SIMD. Historically GPU ALUs have avoided the gather/scatter problem because pixel shading doesn't expose the ALUs to it - the pipeline has been designed to farm out texture-mapping gather and pixel-blending scatter operations.

Many of these fancy new algorithms (or re-discovered supercomputing principles) push repeatedly on the gather/scatter button at the ALU instruction level.

If it weren't for the x86 core, x86 hardware thread context, comparatively miniscule reg file and its reg/memory operand legacy, I wonder if its designers would have skipped over that "feature".

G80->G92->GT200 saw progressively increasing register capacity and/or increasing work-items per SIMD. Fermi actually reverses things a little, I think. In other words it seems to me NVidia hasn't really settled on anything.

Obviously this discussion would be easier with Larrabee to play with. But I trust the Intel engineers to the extent that the originally-presented chip wasn't fundamentally broken in this respect. Though I still strongly suspect texturing is a black hole they've been struggling with.

One could argue that texturing is still so massively important that it steers GPUs towards large RFs and the ALU-gather-scatter centric argument is merely a distraction, and Intel's stumbling block

A memory access is not as cheap as a register file access, for various reasons. It is a much more complex case to get right, and getting it wrong has much bigger consequences for the system in general. The load/store and execution pipelines of even the P55 core are at least somewhat more complex because of this.

That's all very well. But GPU performance falls off a cliff if the context doesn't fit into the RF (don't know how successfully GF100 tackles this). So, what we're looking for is an architecture that degrades gracefully in the face of an increasing context.

The question is: can register files either keep growing or at the least retain their current size, in the face of ever more-complex workloads?

What happens when GPUs have to support true multiple, heavyweight, contexts all providing real time responsiveness? The stuff we take for granted on CPUs?

I wouldn't mind accessing that pool of SRAM, perhaps in some kind of linear line access absent the TLB and fault handling part of the pipeline, but those are usually an integral part of the pipeline and not totally removable.
I would be curious if Nvidia's configurable L1 does somehow convert cache accesses to the shared memory region into something addressed to the physical lines of the cache, though it could just be some kind of creative page mapping, where the cache logic does not bother to keep it coherent.

NVidia has a gather unit (the operand collector) that essentially hides a load of mess there (and a store queue). I'm presuming the cache is just coherent+bankset aligned accesses to banked shared memory.

I would potentially disagree, if I knew more of the implementation. It's possible that Larrabee would already have store queues as explicit parts of its memory pipeline.
The L1 and registers are just what they are, and what heirarchy they implement is what any other fully fleshed out memory pipeline can provide with proper software usage.

Sorry, I wasn't trying to say that L1/registers replace a conventional memory interface for the ALUs - I'm simply saying that the way Larrabee is designed, gather/scatter is built upon the workings of L1/registers. This comes back to the SIMD architecture and first-class gather/scatter. Gotta wait to see it in action...

Jawed

Jawed · Apr 23, 2010

rpg.314 said:
Has to be. All the the 16 lanes in a thread have to access the same register in their respective banks, with additional banking over the 4 xyzw slots. Independent register fetch from each lane would be quite a bit of overhead.

I agree it's highly likely. Just haven't seen absolute evidence though.

In that sense, it is not very different from nv's scatter gather every cycle.

Actually ATI is quite different. There's no latency-hiding for gather misses, all the resulting latency is always experienced by the ALUs. NVidia's operand collection hides that latency, generally.

This is really only relevant to constant buffer fetches and LDS accesses. It's possible to make an indexed register operand cause a stall (indexed read after indexed write, I think) or an SR cause a stall. But that's not a gather/scatter problem per se.

ATI RF is designed, generally, never to induce variable latency. It's the entire basis of the clause-by-clause execution model.

Jawed

AMD: R9xx Speculation

3dcgi

Tchock

neliz

GIGABYTE Man

psolord

rpg.314

neliz

GIGABYTE Man

psolord

Jawed

Jawed

w0mbat

racca

Jawed

MfA

Jawed

MfA

3dilettante

Jawed

rpg.314

Jawed

Jawed

Similar threads