The Official NVIDIA G80 Architecture Thread

Megadrive1988 · Nov 10, 2006

it'll be interesting to see what Nvidia does with the G80 refresh, G85 / G87 / NV55 / whatever, for late 2007.

I'm not talking about like a '8800 Ultra' speed bump to core/memory but an actual refresh.

dnavas · Nov 12, 2006

I'm wondering if there is any evidence for 16-SIMD in the cluster (as opposed, say, to 2 banks of 8 or 4 banks of 4)? In theory, splitting up into smaller pieces would cost more in scheduling trannies, but, provide more flexibility for a single-cluster chip (that'd allow vertex&pixel progs to run concurrently, as opposed to having to create some kind of interrupt-driven time slicing mechanism), map more correctly to the smaller set of texture units (4 or 8, take your pick), and match the scheduling factor of R580 and Xenos more closely.

I don't know why the latter two chips schedule batches at 4x SIMD width, but that appears to be the case. If nVidia requires the same relative factor, that'd argue for 8-SIMD. 8-SIMD would also match the pictoral grouping of 8 st. procs in the tech brief, and the number of texture units in the cluster.

That's hardly evidence of anything either, but, I'm curious what evidence we might have for 16, other than it would seem obvious that the units in the cluster are all SIMD scheduled/dispatched.

I think I'll hold off on what that might mean (either now, or in future revs) for handling dispatches of vector instructions where the entire SIMD-width is predicated-ignored. I'm already well out on a limb ;-)

Mintmaster · Nov 12, 2006

16 SIMD does indeed make a lot of sense, as you basically change parallelism from vectors in each pixel to channels in each batch. Dave gave me hint a while ago that this gen might be going scalar, and this is the first thing that came to my mind.

The batches are 32 pixels or 16 vertices if I recall correctly, so basically if you wanted to do a MADD, you'd do Ax * Bx + Cx for 16 pixels in one cycle, then Ay * By + Cy in the next, etc. Under the old scheme (G7x and earlier), you'd do (Ax*Bx+Cx, Ay*By+Cy, Az*Bz+Cz) for the 4 pixels in a quad.

This might be a reason for the "missing MUL", as it changes how you schedule things. Or maybe the MUL has limitations, but still comes in useful. If you were doing a DP3, that can be done with 2 MADD's and a MUL by stepping across the channels, and would only take 2 cycles instead of 3. That would become 64 DP3's per clock.

DemoCoder · Nov 12, 2006

The "missing MUL" is problem is actually alot simpler than that.

ChrisRay · Nov 12, 2006

DemoCoder said:
The "missing MUL" is problem is actually alot simpler than that.

I know what alot of us hope it means. But please do clarify!

DemoCoder · Nov 12, 2006

Let's just say that the situation is not altogether different than some of the limitations on ALU1 in the NV4x architecture (and no, I don't mean that it necessarily has to do with texturing), just in the general sense of dependency on *something*

I'm not sure if I can say more.

nAo · Nov 12, 2006

register file bandwidth might be a reason (it should be fairly easy to test..)

dnavas · Nov 12, 2006

Fig 22 in tech brief would seem to indicate that [A]ddressing doesn't issue at the same time as [Math]. Could well be register file bandwidth....

Xmas · Nov 12, 2006

Davros said:
apparently these are the new ogl extensions

GL_ES

Hmm...

PeterAce · Nov 13, 2006

A quick question reguarding G80 thread/batch size.

I understand that it is 32 (pixels) or 16 (vert), why was 32 pixel chosen?

I understand that for Xenos/C1 that the thread size is 64 pixels because it is a function of the 16-way SIMD array size and because ALU latency is 8 clock cycles, threads are changed every 4 clocks to hide ALU latency. So its 16 * 4 = 64 (I beleive they could of chosen 8 clocks but that would mean a batch size of 128....poor for dynamic branching).

So is G80's batch size 16 (a cluster) * 2 clocks? Does this mean that the ALU latency is only 2 clocks?

Jawed · Nov 13, 2006

G80 can issue a different instruction every clock.

So, you could have a scalar ADD instruction issued in one batch, say from a vertex shader, and then on the next clock another scalar instruction for another batch (another VS, say - VS and GS batches are 16 objects apiece).

Why pixels are in batches of 32 hasn't been well-explained so far. I hypothesise this is because the rasteriser works on two rows (or two columns) of pixels simultaneously...

(Not an particularly illuminating hypothesis if you consider that pixels need to be batched as quads...)

Jawed

Mintmaster · Nov 13, 2006

It could be that batches of 32 are needed to completely hide texture latency, since you can only handle a finite number of batches simultaneously. Using smaller vertex batches probably allows you to load balance better, but stalling for a cycle during a VTF isn't a big deal, so its a worthwhile tradeoff.

Rufus · Nov 14, 2006

I just found some CUDA slides from the SuperComputer 2006 GPGPU workshop here:
http://www.gpgpu.org/sc2006/slides/08b.buck.cuda.pdf

Interesting thing is that the slides (pg 10) state "Parallel Data Cache per cluster 16KB". The diagram at the start of this thread says that L1 is 8KB per cluster. Is the L1 number just wrong, or where'd the other 8KB go?

psurge · Nov 14, 2006

It's possible that it's a completely separate cache from the texture cache, no? Also, considering it's "as fast as registers" maybe the parallel data cache is nothing more than the register file?

nAo · Nov 14, 2006

psurge said:
It's possible that it's a completely separate cache from the texture cache, no? Also, considering it's "as fast as registers" maybe the parallel data cache is nothing more than the register file?

Umh..a 16k register file per cluster is not unlikely but I'd expect it to be bigger than that (a shader which use 4 vec4 regs would let you have only 100 mem cycles to hide texture latency..), let say 32k

IMHO they use their L1 cache that is normally used to store texels and costants (I'm assuming costants are stored in a cache since they have to support D3D10 constants buffers..)

LeStoffer · Nov 14, 2006

Nice, Mike Houston did some GPU benchmarking on 8800GTX (against X1900XTX and 7900GTX). :smile:

Lots of stuff, but go see page 24 to chech out the speed of different ops and special ops. The 8800GTX doesn't seem to have any glarring weaknesses in this area:

http://www.gpgpu.org/sc2006/slides/10.houston-understanding.pdf

nAo · Nov 15, 2006

nAo said:
Umh..a 16k register file per cluster is not unlikely but I'd expect it to be bigger than that (a shader which use 4 vec4 regs would let you have only 100 mem cycles to hide texture latency..), let say 32k

Thinking more about it.. since it's now possible to hide texture latency via arithmetic ops having a 16 kb register file per cluster should not be as bad as I thought.
Moreover storing data in the L1 caches would not be that smart as probably these texture caches have very long cache lines while on a scalar architecture like this you want probably to be able to read/write a few bytes at once (4, 8? x16 ofc)

nAo · Nov 15, 2006

LeStoffer said:
Nice, Mike Houston did some GPU benchmarking on 8800GTX (against X1900XTX and 7900GTX). :smile:

Lots of stuff, but go see page 24 to chech out the speed of different ops and special ops. The 8800GTX doesn't seem to have any glarring weaknesses in this area:

http://www.gpgpu.org/sc2006/slides/10.houston-understanding.pdf

I wonder why there are no DB figures for the xt1900

psurge · Nov 15, 2006

nAo - your guess is as good as, wait no that's not right: better than mine

. I was thinking maybe the texture cache is so specialized for weird twizzled texture formats and texture access patterns and so on that maybe it doesn't get used for stuff like VTF/constant fetch and so on. Here's another theory - maybe when in CUDA mode 1/2 of the RF is used as an L1 of sorts, while the other 1/2 is reserved for registers?

psurge · Nov 15, 2006

nAo said:
I wonder why there are no DB figures for the xt1900

Looks like the ATI cards failed to run the corresponding tests (I suspect OpenGL driver issues):
http://graphics.stanford.edu/projects/gpubench/results/

The Official NVIDIA G80 Architecture Thread

Megadrive1988

dnavas

Mintmaster

DemoCoder

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

DemoCoder

nAo

Nutella Nutellae

dnavas

Xmas

Porous

PeterAce

Jawed

Mintmaster

Rufus

psurge

nAo

Nutella Nutellae

LeStoffer

nAo

Nutella Nutellae

nAo

Nutella Nutellae

psurge

psurge

Similar threads