The Official NVIDIA G80 Architecture Thread

Considering that G71 has 220 clock cycles of latency hiding, at 650MHz, you'd expect G80 to require in the region of 400+ clock cycles to hide latency at 1350MHz.

Even if G71 has a huge margin for "error", say a factor of 2, then that still means that G80 would require ~200 clock cycles to hide latency.

If you decide that a typical worst case is vec2 instructions being used to hide latency, then each instruction takes 2 clocks for 16 fragments at a time, or 8 fragments per clock. So that's a total of 1600 fragments required to execute a single instruction over a period of 200 clocks.

If you assign 4xfp32s (64 bytes) to each of those fragments, then you get a minimum register file size of 102400 bytes per cluster.

That's 800KB of register file for the entire GPU. Still a small value compared to R580 or Xenos (both are 1152KB as far as I can tell, though excluding vertex shader register file in the case of R580).

And assuming that the worse case bilinear latency cannot be hidden (when only a scalar instruction is available to hide latency) - which is a stretch in my view. In other words I suspect the register file in G80 is actually twice this size.

Jawed
 
Here's three documents:

Simulating Multiported Memories Using Lower Port Count Memories
I haven't studied this in any detail.

I just eyeballed it -- it seems far more complicated than it needs to be, and it seems to suffer higher latency under some circumstances (when operands are in the same bank). From the scheduler's point of view, I'd prefer to know the cost of my instruction, rather than having to go and look at operand/bank management. That said, I don't know how expensive multi-porting is, and how much savings they're achieving here. I've been burned by reading patents too closely in the past, so I may take a pass :)

It might be worthwhile to run a few tests (walking-operand MULs, maybe). For those of you with boards :)
 
If you assume that there's a cache between you and framebuffer memory, then that cache becomes the Parallel Data Cache. By far the best candidate for this is the L2 memory that's associated with each set of ROPs, which are also associated with a memory controller.
That doesn't make any sense. The CUDA docs state that the PDC is fast, 16KB, and per-cluster. The L2 is none of those: it's slow (relatively, since it's off-cluster and in the main clock domain), it should be quite large, and it's shared across all clusters.

Considering that G71 has 220 clock cycles of latency hiding, at 650MHz, you'd expect G80 to require in the region of 400+ clock cycles to hide latency at 1350MHz.
Well it was 650MHz vector operands and a 1.6ghz memory clock. It's now 1.35ghz scalar operands with a 1.8ghz memory clock. I'd say it's just about impossible to extrapolate latency numbers between the two since the entire memory heiarchy is very different and you're processing at a very different granularity.

That said, I don't know how expensive multi-porting is, and how much savings they're achieving here.
Take a look at Toni Juan's paper called Data Caches for Superscalar Processors (PDF at the top-right). If you look at the summary diagram on the last page, you see true multi-porting of a cache scales in area linearly with the number of ports, so a 2-ported cache is about twice the size of a single-ported one.

Then again remember there are 700 million transistors in this beast. Maybe they have some insanely large (area-wise) multi-ported, high-speed register file.
 
Last edited by a moderator:
Rufus -- I'll go look, thanks for the pointer.

I looked a little more at that first patent (I haven't actually sat down to read it), and I can't believe someone got a patent for reducing contention by using separate memory segments, but that's a little besides the point.

I think that this setup is reasonable if the access patterns to your register file are essentially random. Consider the case of a single st.unit in a cluster (multiple st.units are just a matter of data-width). Let's say that each thread is assigned 16 scalar regs (4 vec4 as Jawed was stating earlier -- 64 bytes). Let's say you allow 256 threads. That's 16k. Retrieving any number of operands on a thread is really a matter of selecting the entire 64 byte region, and pulling out the required operands. In short, you do not require multiporting of the entire 16k, you need to multiport access to only 64bytes. If you need four operands you could probably get away with a 16x4 crossbar (each selecting a full fp32 at a time) and call it a day. [I shall merrily gloss over sticky details like what happens when a thread needs more than 16 scalar regs ;-) ]

The kind of setup they describe where they go to all kinds of lengths to sprinkle state across multiple segments, would seem to be increasingly advantageous when a single thread accesses a very large chunk of register-memory. Like, if you had some big cuda thread with access to 16k as an utterly random example.... :)
 
That doesn't make any sense. The CUDA docs state that the PDC is fast, 16KB, and per-cluster. The L2 is none of those: it's slow (relatively, since it's off-cluster and in the main clock domain), it should be quite large, and it's shared across all clusters.
I haven't read the CUDA docs, don't you have to be registered to do so?

So, are you saying that sharing of the PDC is restricted to the threads executing within a cluster? Are threads effectively grouped, e.g. by (x,y) coordinates, in order to maintain PDC coherency? Sounds like a precarious situation, either not knowing which combinations of threads can share data with each other, or finding that the set of threads you can share data with is arbitrarily restricted.

If the PDCs are sharable by all extant threads, then they must be proxied somewhere else outside of the cluster.

Well it was 650MHz vector operands and a 1.6ghz memory clock. It's now 1.35ghz scalar operands with a 1.8ghz memory clock. I'd say it's just about impossible to extrapolate latency numbers between the two since the entire memory heiarchy is very different and you're processing at a very different granularity.
It may well be impossible, but the fact is that a 1350MHz ALU is going to progress through instructions ~twice as fast as G71. Or G72, or G73 etc. The relative clock rates (core versus memory) as well as latencies associated with DDR or GDDR do vary significantly across the range of G7x GPUs, all based on the same pipeline. So the pipeline will have been designed for some kind of worst-case ratio.

NVidia's motivation for making G71's pipeline support 220 quads in flight isn't trivial - every extra quad in flight is more register file and more FIFO buffering - so there's a good reason why it's so long. That reason doesn't just disappear in G80 - and while the scalar instruction issue capability of G80 means an "average instruction" (say vec4 MAD) will progress at ~1/2 the rate it does in G71 (per fragment) the worst-case here is a scalar instruction which will progress at ~2x G71's instruction rate.

Not to mention that G80 has a baseline 2:1 ALU:TEX ratio - which adds in another factor of 2 compared to the effectively stall-less rate of texturing in G71.

Jawed
 
I haven't read the CUDA docs, don't you have to be registered to do so?
Sorry, by "docs" I meant the slides I linked in my previous post
So, are you saying that sharing of the PDC is restricted to the threads executing within a cluster? Are threads effectively grouped, e.g. by (x,y) coordinates, in order to maintain PDC coherency? Sounds like a precarious situation, either not knowing which combinations of threads can share data with each other, or finding that the set of threads you can share data with is arbitrarily restricted.

If the PDCs are sharable by all extant threads, then they must be proxied somewhere else outside of the cluster.
On slide 10 it says "Parallel Data Cache per cluster 16KB" and the diagram (which true, are notoriously bad) indicates 16 SPs talking with the PDC. That seems to indicate that yes, PDC is restricted to the threads executing within a cluster.

The code example (slide 4) comments say "5000 thread blocks", "256 threads per block", and "64 bytes of shared memory." So is the shared memory (I assume this means PDC) accessable by all 256 threads in a block, or by all 5000 * 256 threads total. If all the 256 * 5000 threads can execute on any cluster and access the PDC equally, then why is there the definition of a block?

It may well be impossible, but the fact is that a 1350MHz ALU is going to progress through instructions ~twice as fast as G71. Or G72, or G73 etc. The relative clock rates (core versus memory) as well as latencies associated with DDR or GDDR do vary significantly across the range of G7x GPUs, all based on the same pipeline. So the pipeline will have been designed for some kind of worst-case ratio.

NVidia's motivation for making G71's pipeline support 220 quads in flight isn't trivial - every extra quad in flight is more register file and more FIFO buffering - so there's a good reason why it's so long. That reason doesn't just disappear in G80 - and while the scalar instruction issue capability of G80 means an "average instruction" (say vec4 MAD) will progress at ~1/2 the rate it does in G71 (per fragment) the worst-case here is a scalar instruction which will progress at ~2x G71's instruction rate.

Not to mention that G80 has a baseline 2:1 ALU:TEX ratio - which adds in another factor of 2 compared to the effectively stall-less rate of texturing in G71.
I'm not sure where you got that 220 quads in flight number from. Is that the total number of quads in flight across all 24 pipes, or quads in flight for each pipe. That per pipe would be an insane (5,280) number in flight total. More reasonable would be that there are 9 quads in flight per pipeline, meaning 36 pixels. Remember that every pixel will normally be doing a Vec4 fetch and calculation.

If G7x needed 36 pixels in flight per pipe then lets make up number and say that the G8x would requre 64 pixels in flight due to higher latency. However since each SP will only be doing scalar calculations there will be 2-4 times as many calculation cycles for every equivilent G7x calculation cycle, meaning that only 16-32 pixels would need to be in flight per SP.

Using that number and the example 4 FP32 registers that means you'd need 16 * 16 * 64 = 16KB to 32 * 16 * 64 = 32 KB register file / cluster.
 
Last edited by a moderator:
I'm not sure where you got that 220 quads in flight number from. Is that the total number of quads in flight across all 24 pipes, or quads in flight for each pipe. That per pipe would be an insane (5,280) number in flight total. More reasonable would be that there are 9 quads in flight per pipeline, meaning 36 pixels. Remember that every pixel will normally be doing a Vec4 fetch and calculation.
on G7x every ps pipes are grouped into processors called 'quads', each quad has got 4 pipes.
The number Jawed often quotes is referred to the number of quads in flight per 'quad processor'.
I'm not say Jawed's numeber is correct, in fact it's not..
 
Sorry, by "docs" I meant the slides I linked in my previous post

On slide 10 it says "Parallel Data Cache per cluster 16KB" and the diagram (which true, are notoriously bad) indicates 16 SPs talking with the PDC. That seems to indicate that yes, PDC is restricted to the threads executing within a cluster.
Ah, OK. That picture literally places the PDC in the same part of the cluster as we've seen L1 cache in other architectural diagrams of G80. So yes, that would seem to indicate locality.

The next question, then, is whether writes to L1 (in PDC mode) are replicated out to L2/local-memory. Oh well, just have to wait and see for that.

The code example (slide 4) comments say "5000 thread blocks", "256 threads per block", and "64 bytes of shared memory." So is the shared memory (I assume this means PDC) accessable by all 256 threads in a block, or by all 5000 * 256 threads total. If all the 256 * 5000 threads can execute on any cluster and access the PDC equally, then why is there the definition of a block?
256*64 is 16KB.

This implies to me that a 16KB PDC constitutes the entire shared memory owned by a "thread block" (i.e. not the same memory as the register file - I still think one needs to think of PDC as a mini framebuffer, literally a small tile - this may well link-up with the screen-space tiling that G80 uses (reputed to use?)).

A thread (within a thread block) is analogous to a vertex (or primitive) which we know is batched by G80 in 16s. So 16 batches, each of 16 threads can exist in one thread block.

So when execution of thread blocks switches from block A to block B, a "cache swap" is done by flushing A's PDC from L1 into L2 (say) and bringing B's PDC in from L2 to L1. All computations while block B is active then proceed at full speed. Flushes from L1 to L2 may be succeeded by writes into memory, of course. And B's PDC might have started in memory and would have had to have been brought in on the swap - I dare say it'd be trivial for G80 to pre-fetch this into L2, it's no different from pre-fetching a texture. I don't know what the duration of PDC is meant to be...

5000 thread blocks implies, to me, that there are 625 thread blocks supported per cluster.

5000 thread blocks is 78MB.

So, in summary, I think PDCs are simply render target tiles that can be read immediately after being written, by any thread in the block. Whether the PDCs are kept between thread block swaps is another question. I don't see why not, since G80 appears to have a cache hierarchy that's dual-use: for textures and colour/z/stencil (which are merely two different kinds of views of memory as far as D3D10 is concerned). (The cache hierarchy prolly also includes direct support for constant buffers.)

I'm not sure where you got that 220 quads in flight number from. Is that the total number of quads in flight across all 24 pipes, or quads in flight for each pipe. That per pipe would be an insane (5,280) number in flight total.
Insane or not, that's the truth of it ;) If the shader program has less registers assigned per fragment, then G71 can actually put more quads into a batch (it wants to keep the register file "full"). Similarly if the shader has more registers assigned, then the number of quads in flight will fall below the nominal 220 - which is when you start to lose performance as some clock cycles will have no work to do.

The number of fragments in flight in R580 are somewhat higher: 128 batches per cluster, 48 fragments per batch, 4 clusters... R580 seems to be quite happy if you use 128 registers in a shader program - but that's an extreme and you'll only get a few hundred fragments in flight if you do that... Performance then depends directly upon how intensively you use branches or texture fetches.

Jawed
 
Considering that G71 has 220 clock cycles of latency hiding, at 650MHz, you'd expect G80 to require in the region of 400+ clock cycles to hide latency at 1350MHz.
Dunno why you keep quoting that number about G71, we discussed how it works endless times and we learnt that there's not a priori fixed capability of latency hiding.
If you decide that a typical worst case is vec2 instructions being used to hide latency, then each instruction takes 2 clocks for 16 fragments at a time, or 8 fragments per clock. So that's a total of 1600 fragments required to execute a single instruction over a period of 200 clocks.
That's perfectly fine, if you don't have enough fragments to cover latency you can cover it adding more math. Have you noticed that the hw can only address 4 texture per gpu clock per cycle?
Basicly you can issue a tex fetch per pixel only every 2.5 SP cycles! you need to run all your pixels in flight almost 3 times before the can return to you some sampled value for all them assuming all your texels are already in cache.
If you assign 4xfp32s (64 bytes) to each of those fragments, then you get a minimum register file size of 102400 bytes per cluster.


That's 800KB of register file for the entire GPU. Still a small value compared to R580 or Xenos (both are 1152KB as far as I can tell, though excluding vertex shader register file in the case of R580).
These numbers are absurdely high, g70 does pretty well with 1/10th of that memory!
 
Dunno why you keep quoting that number about G71, we discussed how it works endless times and we learnt that there's not a priori fixed capability of latency hiding.
While 220 is the nominal value, reduced register usage allows more quads in flight which will increase latency-hiding. If the number of quads falls below this level then the pipeline can no longer fully hide all bilinear texturing latency, which is when performance will fall off.

Only if there's no texturing will performance be unaffected by the reduced number of quads in flight.

That's perfectly fine, if you don't have enough fragments to cover latency you can cover it adding more math.
You're thinking of non-dependent instructions. When you have decoupled asynchronous texturing pipelines such as R5xx or G80 then dependent instructions (i.e. those waiting for a texture result) will cause a stall unless you have enough threads in flight.

Have you noticed that the hw can only address 4 texture per gpu clock per cycle?
All the more reason to have a lot of fragments in flight. R580 has the same kind of problem, it can only issue 1 quad TEX per clock while it processes 3 ALU quads per clock (all per cluster).

Jawed
 
These numbers are absurdely high, g70 does pretty well with 1/10th of that memory!

Assuming it's implemented in SRAM, you're absolutely correct, and I'm hoping someone explains how latency hiding actually happens. :) I've been searching back on your posts and remain, sadly, unenlightened. :( I assume the idea is that the more registers you use, the more likely that you have math that you can interleave. There are also the affects of scalar vs. vector (so each vec instructions winds up being X # scalars), the difference in clock speeds, the number of TXU vs. ALU, etc. 1-2Mbyte of SRAM is 50-100 million transistors. That seems excessive for just a register file. It's difficult to believe that 16k would be sufficient across the entire cluster, though.
 
If you decide that a typical worst case is vec2 instructions being used to hide latency, then each instruction takes 2 clocks for 16 fragments at a time, or 8 fragments per clock. So that's a total of 1600 fragments required to execute a single instruction over a period of 200 clocks.

If you assign 4xfp32s (64 bytes) to each of those fragments, then you get a minimum register file size of 102400 bytes per cluster.

Hmm, let's try this instead. You perform a texture lookup, retrieving a Vec4, which you use to multiply by another Vec4. The pixel batch width is 32. You need four scalar fp32 regs for a total of 512bytes/reg/batch. To perform the MUL, you need two regs -- that's 1k/batch.

You perform the texture address lookup -- this takes two cycles across the entire batch. The 4Vec MUL (when it finally happens) takes 8 cycles. You've essentially gotten 10 cycles of latency hiding right there for some other thread. To hide 200 cycles, you'd need 20 batches. Round up to 32, and you get 32k cache. [Yes, I'm finding it hard to accept 16k]

Differences between your number and mine: I'm assuming you wind up with a Vec4, not a Vec2; I'm also assuming a very fine-grained register allocation.

Theoretically, as you add registers, you add ALU work. I suppose it's possible that it rises as slowly as log-n base-3, I havent really thought that through. But, it wouldn't surprise me if it rises linearly. So, if you had twice the number of registers, you'd expect to have twice the amount of work.

Even if you had a MAD across three texture fetches, that's 1.5k/batch, 6 cycles for texture fetch, 8 cycles for MAD, and if you're trying to hide 600 cycles, that winds up at 64k.

128/256k does seem excessive...?

ed: Bob -- thanks, I've bookmarked that post. Oddly similar numbers :)
 
Last edited by a moderator:
Jawed: I think you're trying to make far too concrete of statements based on very speculative information.

All 3 of the parameters - blocks, threads / block, and shared memory / (??) - are clearly user-configurable values. The example are probably reasonable values for mapping onto current hardware and you can use them to try to infer things, but we aren't even certain what they mean.

The fact that 256 * 64 = 16k is probably a complete coincidence. I find it hard to believe that you'd state how much memory each thread will need since that is a function of register usage, which is a function of the compiler. Also what definition of "shared" would mean per-thread. Shared memory, by definition, would either be how much of the PDC is shared between the threads in a block, or how much of global memory is shared across all the blocks.

5000 thread blocks implies, to me, that there are 625 thread blocks supported per cluster.
How do you jump to this conclusion? 5000 is clearly an arbitrary number, so what happens if the user asks for 10,000. Anyway, what do you mean by "supported"? The only sane design I can think of has the clusters only supporting 1 block at a time, with the global scheduler (on the diagram in post #1) feeing it blocks 1 at a time when a cluster is idle.

So when execution of thread blocks switches from block A to block B, a "cache swap" is done...
Is there any reason for a CUDA block or a 32-pixel / 16-vertex batch to migrate off of a cluster and then come back? I can't think of any reason not to have a batch/block execute the entire way through a (cuda/vertex/shader) program before looking at the next batch. It shouldn't matter when / in what order every block/batch finishes as long as they all eventually finish. Swapping seems extremely complex considering how many outstanding loads you might have in flight.

Anyway I ramble too much, L'Inq aren't the only people who abuse half-understood numbers, repeating PDC makes me hungry thinking of the Pittsburgh Deli Company, and I should be off to sleep.
 
each streaming unit goes to its own dedicated RAM, pulls its local register values, performs its op, and stores appropriately. That seems simpler to me than some kind of super-wide bus with a flat address space, with the dispatcher pulling all the data and then parceling out the data to each ALU, although the results are much the same....

I think the former is exactly what the G80 is doing. The only case where any temporary register values need to be shared between fragments is the gradient instructions, and those might be done in the special function units which probably already share some connection amongst themselves to optimize interpolant calculation between adjacent fragments. Otherwise, it means sense to just partition everything for maximum locality. The less inter-SP traffic the better.

When I first looked at the G80 I myself was like "wha? I don't get it" because I was still thinking in terms of SIMD, in terms of vectors, in terms of extracting ILP. The first problem I came across was swizzling and replication, because I was still assuming things like different SPs running different "channels" of the MUL, and therefore a need for a multiported register file as well as swizzle/copy in HW. Of course, that's not the way it really works, in reality, a vector op runs on a single SP, and there is no need to share register files between SPs. Swizzling then just becomes remapped compiler labels.

It helps to try and forget about vectors all together when analyzing the G80 arch I think. If you start out with a workload example, like a TEX/MUL shader, the first thing one should do is "run the compiler in one's head" and scalarize the shader, and then analyze the scalarized shader. Thus, a MUL on 4 channels becomes 4 instructions on one SP and uses potentially up to 12 scalar registers (8 src, 4 destination, a = b * c)

Thus, one should not think about a given fragment/thread ever spanning more than one SP and register file.
 
To get a felling how the vector based HLSL/GLSL or they different asm variants are translated to a scalar based GPU technologies a look at the Intel source code for their last chipsets could be very useful.
 
When I first looked at the G80 I myself was like "wha? I don't get it" because I was still thinking in terms of SIMD, in terms of vectors, in terms of extracting ILP. The first problem I came across was swizzling and replication, because I was still assuming things like different SPs running different "channels" of the MUL, and therefore a need for a multiported register file as well as swizzle/copy in HW. Of course, that's not the way it really works, in reality, a vector op runs on a single SP, and there is no need to share register files between SPs. Swizzling then just becomes remapped compiler labels.
That's sort of what I was saying in the other thread. It's still SIMD, but the parallelism is in the other direction now. All the SP's will work on the same channel each cycle for 16 pixels. I guess it's sort of like the AOS vs. SOA idea.

So really they still are vectors, but instead of 3D or 4D vectors, it's more like vectors in the way the Xenos documentation discusses it (64 pixels or 64 vertices comprise a vector).
 
Back
Top