The Official NVIDIA G80 Architecture Thread

Xenos uses a "pipeline temp":

http://www.beyond3d.com/forum/showthread.php?p=689500&highlight=adds_prev#post689500

seemingly as a way to get around register bandwidth limitations.

(Note on my posting: vec3+scalar+SF isn't possible, I wasn't thinking straight, sigh.)

And judging from:

Method and Apparatus for Multi-thread Accumulation Buffering in a Computation Engine

variants of this kind of technique are widely used in ATI's GPUs (I haven't actually read this patent, or other patent documents for stuff that's very similar - just judging from the diagrams).

That dates from 2000, and it prolly explains the "less than expected" VS performance that raises comments from time to time. Guessing...

Jawed
 
SPs are 16x1 or 8x2?

Long-time lurker, first-time poster -- hello!

Here's something that's been nagging me since the G80 reviews hit the web..

The B3D G80 architecture diagrams and review describe the SPs in a cluster as a single 16-wide SIMD group. So does the text of all the other reviews I've read. But Nvidia's marketing diagrams show two groups of 8 SPs -- there's a thin line around each set of 8, and the "SP" label appears twice per cluster. I know marketing diagrams aren't known for accurate details, and since nobody else has described it that way I'm probably just reading way too much into a few pixels on a single slide...

But suppose it really is two independent 8-wide SIMD groups. For the 16-wide VS batches, this would mean each group would have to choose a batch to execute and decode/issue an instruction for it every two shader-clock cycles, rather than on every cycle (but for one group instead of two). For pixel threads the scheduler would issue the same instruction to a batch over four shader-clock cycles instead of two.

Both 8x2 and 16x1 come out to the same overall issue rate, of course, and will have the same branching performance. But with 8-wide the front-end is twice as wide but running at half speed, which might be better from an area/power efficiency standpoint. Getting the shader clock that high requires custom logic design, right? Would the scheduling/decode logic be harder (e.g. less regular?) to custom-design than the core ALU logic? There are either 8 or 16 times as many ALUs as there are schedulers, so the custom-design effort is leveraged more for ALUs.

This reminds me of the P4's double-pumped ALU design -- if I understand it right the whole instruction pipeline wasn't double-pumped, just the core math units.

I haven't figured out how to write a test that would clearly show whether G80 is 16x1 or 8x2. As far as I can tell it wouldn't necessarily affect performance at all, just efficiency (possibly).

Any suggestions on how to test this?
 
arm_arch: see my posts #122, #42, and geo's (I think) response to the latter in this thread.

I don't think we've gotten an answer to the question yet.
 
It's 8x2, AFAIK. And welcome! :)
There are a lot of reasons I could think of to explain why they did it that way.
Besides those you have already listed, two of them are redundancy (think multiplexers...) and the handheld part, but heh... ;)

Uttar
 
I keep thinking all this "scalar" information could be a hoax, or to put it more mildly a little marketing spin on a grain of technical truth.

What if a "scalar ALU" is really a 4-wide SIMD ALU that performs an instruction not on four color channels of a single fragment at once but on the red color channels of four fragments? You just rotate the SIMD around, have some trouble with synching up the texture ops, but no longer need to worry about co-issue splitting for scalars, 2Ds and 3Ds.

Code:
R G B A
R G B A
R G B A
R G B A
The four rows should represent four fragments. "Old" SIMD style would iterate rows. Marketing-pseudo-scalar would iterate columns.
G80 uses Vec16-ALUs processing a scalar value of 16 pixels (4 quads) each.

edit: Oops I see that this issue was clarified already.
 
Last edited by a moderator:
Umh...better to say that it's 8 wide unit that execute the same instruction for 2 or 4 clock cycles
 
8 wide (2 clocks vertex and 4 clocks pixel) makes much more sence because of covering ALU latency.... than 16 wide (1 and 2 clocks).
 
arm_arch: see my posts #122, #42, and geo's (I think) response to the latter in this thread.

Oops, you're right. Guess I could have saved myself some thinking if I'd read more carefully.

There are a lot of reasons I could think of to explain why they did it that way. Besides those you have already listed, two of them are redundancy (think multiplexers...) and the handheld part, but heh... ;)

Yeah, I came up with a bunch of reasons (probably mostly hair-brained, but hey..) but my post was already too long. Having the freedom to adjust the ALU/TEX ratio in derived chips (8x1, 8x3, ...) seems like a good way to hedge your bets when you're designing an architecture multiple years before it's released.

8 wide (2 clocks vertex and 4 clocks pixel) makes much more sence because of covering ALU latency.... than 16 wide (1 and 2 clocks).

Why is that? The 8-wide design would need half as many batches *per group* to cover the same number of latency cycles as the 16-wide design. But it has twice as many groups. So with a fixed number of batches per ALU+TEX cluster I think you can hide the same amount of latency either way. That assumes you've got enough work to actually fill up the system; if not full, the 8-wide won't do as well unless you can balance load perfectly between the two sets.
 
I think at this point, it's fairly clear it's not currently exposed whatsoever. The question, of course, is whether future drivers will expose it...
I recently found a NVIDIA doc (from techday iirc, but it wasn't published so I'll assume it's still that it's still under NDA, theorically speaking!) that lists the G800 as having ~345GFlops, and not ~500GFlops, btw.r

That table (if it is the same i remember) counted explicitly only MULs to make G80 stand out even more from G71 and R580. And since a MUL is worth only one half a MADD in terms of FLOPs the 345 GFLOPs are correct - given that it can indeed dual issue two MULs.
 
That table explicitly says GMULs, as it happens, not GFLOPs.
 
Is there any chance that the MUL has a different assembler code, e.g. MLA?

Is it worthwhile taking something like 3DMk or D3 and doing a detailed analysis of the run times? to see if there's any evidence NVidia's deploying it even though plebs can't get at it?

Jawed
 
Is there any chance that the MUL has a different assembler code, e.g. MLA?
I'm not sure I see them special casing the issue for certain apps (at the moment). Their driver engineering efforts for the shader assembler (and in general) are very likely to be elsewhere just now.

It's not that difficult to see what instructions the driver accepts for hand-written assembly in HLSL or GLSL either, if that's what you meant. There's no MUL variant I can see at any rate, and no tools will emit anything that's correct for that anyway, even if it did exist.

As for profiling shipping applications, stay tuned.....
 
Something I noticed in FEAR XP (on the carpent) and CoJ (clear to see), allthough G80's IQ has taken a quantum leap forewards, I still see trilinear optimizations (in the form of mipmap boundrys) in this game with the HQ setting in CP with fw 97.28
 
Last edited by a moderator:
I doubt its related to trilinear optimisation. It could be the mip maps themselves. I noticed that on some textures ((such as gravel roads, still exhibit some shimmering,)) even with superior filtering. It even happens on Geforce FX hardware in high quality mode.
 
I doubt its related to trilinear optimisation. It could be the mip maps themselves. I noticed that on some textures ((such as gravel roads, still exhibit some shimmering,)) even with superior filtering. It even happens on Geforce FX hardware in high quality mode.

I don't know, I see it indeed only on certain texture(s(stages)
 
Is there a way to determine what texture stage you're looking at when a mip-map transition is visible in a retail game?
 
Something I noticed in FEAR XP (on the carpent) and CoJ (clear to see), allthough G80's IQ has taken a quantum leap forewards, I still see trilinear optimizations (in the form of mipmap boundrys) in this game with the HQ setting in CP with fw 97.28

Any screenshot you could help out with? I'm not saying it is, but even the best filtering method cannot cure lacklustering content.
 
Err, 97.28? Is it there with 97.02? One gets what one deserves with funky fly by night "My buddy Jose said this is some good sh*t here, man" drivers.
 
Back
Top