The Official NVIDIA G80 Architecture Thread

it'll be interesting to see what Nvidia does with the G80 refresh, G85 / G87 / NV55 / whatever, for late 2007.

I'm not talking about like a '8800 Ultra' speed bump to core/memory but an actual refresh.
 
I'm wondering if there is any evidence for 16-SIMD in the cluster (as opposed, say, to 2 banks of 8 or 4 banks of 4)? In theory, splitting up into smaller pieces would cost more in scheduling trannies, but, provide more flexibility for a single-cluster chip (that'd allow vertex&pixel progs to run concurrently, as opposed to having to create some kind of interrupt-driven time slicing mechanism), map more correctly to the smaller set of texture units (4 or 8, take your pick), and match the scheduling factor of R580 and Xenos more closely.

I don't know why the latter two chips schedule batches at 4x SIMD width, but that appears to be the case. If nVidia requires the same relative factor, that'd argue for 8-SIMD. 8-SIMD would also match the pictoral grouping of 8 st. procs in the tech brief, and the number of texture units in the cluster.

That's hardly evidence of anything either, but, I'm curious what evidence we might have for 16, other than it would seem obvious that the units in the cluster are all SIMD scheduled/dispatched.

I think I'll hold off on what that might mean (either now, or in future revs) for handling dispatches of vector instructions where the entire SIMD-width is predicated-ignored. I'm already well out on a limb ;-)
 
16 SIMD does indeed make a lot of sense, as you basically change parallelism from vectors in each pixel to channels in each batch. Dave gave me hint a while ago that this gen might be going scalar, and this is the first thing that came to my mind.

The batches are 32 pixels or 16 vertices if I recall correctly, so basically if you wanted to do a MADD, you'd do Ax * Bx + Cx for 16 pixels in one cycle, then Ay * By + Cy in the next, etc. Under the old scheme (G7x and earlier), you'd do (Ax*Bx+Cx, Ay*By+Cy, Az*Bz+Cz) for the 4 pixels in a quad.

This might be a reason for the "missing MUL", as it changes how you schedule things. Or maybe the MUL has limitations, but still comes in useful. If you were doing a DP3, that can be done with 2 MADD's and a MUL by stepping across the channels, and would only take 2 cycles instead of 3. That would become 64 DP3's per clock.
 
Last edited by a moderator:
Let's just say that the situation is not altogether different than some of the limitations on ALU1 in the NV4x architecture (and no, I don't mean that it necessarily has to do with texturing), just in the general sense of dependency on *something* :) I'm not sure if I can say more.
 
Fig 22 in tech brief would seem to indicate that [A]ddressing doesn't issue at the same time as [Math]. Could well be register file bandwidth....
 
A quick question reguarding G80 thread/batch size.

I understand that it is 32 (pixels) or 16 (vert), why was 32 pixel chosen?

I understand that for Xenos/C1 that the thread size is 64 pixels because it is a function of the 16-way SIMD array size and because ALU latency is 8 clock cycles, threads are changed every 4 clocks to hide ALU latency. So its 16 * 4 = 64 (I beleive they could of chosen 8 clocks but that would mean a batch size of 128....poor for dynamic branching).

So is G80's batch size 16 (a cluster) * 2 clocks? Does this mean that the ALU latency is only 2 clocks?
 
G80 can issue a different instruction every clock.

So, you could have a scalar ADD instruction issued in one batch, say from a vertex shader, and then on the next clock another scalar instruction for another batch (another VS, say - VS and GS batches are 16 objects apiece).

Why pixels are in batches of 32 hasn't been well-explained so far. I hypothesise this is because the rasteriser works on two rows (or two columns) of pixels simultaneously...

(Not an particularly illuminating hypothesis if you consider that pixels need to be batched as quads...)

Jawed
 
Last edited by a moderator:
It could be that batches of 32 are needed to completely hide texture latency, since you can only handle a finite number of batches simultaneously. Using smaller vertex batches probably allows you to load balance better, but stalling for a cycle during a VTF isn't a big deal, so its a worthwhile tradeoff.
 
It's possible that it's a completely separate cache from the texture cache, no? Also, considering it's "as fast as registers" maybe the parallel data cache is nothing more than the register file?
 
It's possible that it's a completely separate cache from the texture cache, no? Also, considering it's "as fast as registers" maybe the parallel data cache is nothing more than the register file?
Umh..a 16k register file per cluster is not unlikely but I'd expect it to be bigger than that (a shader which use 4 vec4 regs would let you have only 100 mem cycles to hide texture latency..), let say 32k :p
IMHO they use their L1 cache that is normally used to store texels and costants (I'm assuming costants are stored in a cache since they have to support D3D10 constants buffers..)
 
Umh..a 16k register file per cluster is not unlikely but I'd expect it to be bigger than that (a shader which use 4 vec4 regs would let you have only 100 mem cycles to hide texture latency..), let say 32k :p
Thinking more about it.. since it's now possible to hide texture latency via arithmetic ops having a 16 kb register file per cluster should not be as bad as I thought.
Moreover storing data in the L1 caches would not be that smart as probably these texture caches have very long cache lines while on a scalar architecture like this you want probably to be able to read/write a few bytes at once (4, 8? x16 ofc)
 
nAo - your guess is as good as, wait no that's not right: better than mine :). I was thinking maybe the texture cache is so specialized for weird twizzled texture formats and texture access patterns and so on that maybe it doesn't get used for stuff like VTF/constant fetch and so on. Here's another theory - maybe when in CUDA mode 1/2 of the RF is used as an L1 of sorts, while the other 1/2 is reserved for registers?
 
  • Like
Reactions: nAo
Back
Top