The Official NVIDIA G80 Architecture Thread

What the same kind of "not quads" as R580's 12 pipelines per cluster?

Jawed

Which were 4 "old-fashioned pipes" with 3 shaders each, if you wish. Thus a quad.

I have no idea about low-level G80's internals, but from what we have so far, it doesn't scream "quads!" at me.
 
I would expect G80 to work on quads of pixels and don't see any reason to believe it works otherwise. I assume 8 quads are accumulated into a batch/vector and then executed a scalar at a time.
 
Which were 4 "old-fashioned pipes" with 3 shaders each, if you wish. Thus a quad.
I don't think that's an accurate description. Instructions aren't executed in groups of three.

I have no idea about low-level G80's internals, but from what we have so far, it doesn't scream "quads!" at me.
Every GPU for the last 7 years has worked on quads and this will continue. There's no other logical choice for texturing LOD and derivative instructions, especially when they're dependent on pixel shader calculations.

Xenos works on groups of 64 pixels/vertices at a time, but the pixels groups are 16 quads. G80 does the same thing with 8 quads in a group.
 
The missing MUL

I'm wondering if the missing MUL is caused by the type of shader.

I'm guessing that a vertex shader, which never has any use for attribute interpolation (though obviously it does use special functions) simplifies the driver compiler's job enough that it can dual-issue the MUL whenever it wants.

So, what it needs is for someone to write a VS, say for creating a shadow buffer with no colour writes in the corresponding PS, and see if they can expose the missing MUL.

Jawed
 
How many of you believe that it hasn't been unleashed on purpose? I'm raising my hand and I hope it won't cost me a second public apology LOL :D
 
How many of you believe that it hasn't been unleashed on purpose? I'm raising my hand and I hope it won't cost me a second public apology LOL :D

Well, certainly. . .I'd be very disappointed if they accidentally left it out! :LOL: But I suspect you had something more machiavellian in mind! ;)

I think it is usually the case that stability and compatibility is --and ought to be-- the first priority in making drivers. Then once that is achieved, start tweaking for performance. Usually this is only going to get bent as a rule when there are extraordinary competitive pressures. Well, NV doesn't have any extraordinary competititive pressures for 8800 right now. They can afford to "do it right", take their time, etc. So yeah, I'd be completely unsurprised if the missing mul shows up a little further down the road. . . . Having said that, the reasons why it isn't showing up now might mean it won't show up 100% of the time in all scenarios later on. Tho that's just a guess.
 
And, it is right. So far no-one has been able to write a shader that gets the extra MUL.
I think at this point, it's fairly clear it's not currently exposed whatsoever. The question, of course, is whether future drivers will expose it...
I recently found a NVIDIA doc (from techday iirc, but it wasn't published so I'll assume it's still that it's still under NDA, theorically speaking!) that lists the G800 as having ~345GFlops, and not ~500GFlops, btw.

The correct way to think about it, of course, is that on NV4x, the first MUL was generally used for perspective correction. On G7x, one of the MADDs was also often reserved for perspective correction. For G8x, it never is. As such, considering inefficiencies etc., it might be fairer to compare G80's ~350GFlops with ~125GFlops for the G71's PS, plus ~50GFlops for the G71's VS. Thus, assumign perfect VS-PS distribution for G71's configurastion, G80 should be at least twice as fast in term of arithmetic, and more if it was unbalanced for G71. I think that's roughly what we're seeing, TBH.

ATI has had a dedicated MUL for perspective correction since R300, if not much before that! The R500 has one too, and you'd expect R600 to have one. It is surprising that NVIDIA has decided to no longer let the MUL be used for generic shader programming, while they used to in the past, and it looked like a good design choice to maximize unit utilization. This is especially true given that the SF unit is used to calculate 1/pos.w! It'll be interesting to see what they decide to do in future designs, of course.


Uttar
 
Well the so far existing drivers simply scream for improvements. I'm starting to question my own abilities with the stupid 3dmark06 story, which currently is a slideshow here. Not that I really care about it in the end, but I've tried all so far recorded success stories from other users on the net and no success.

The only thing I haven't tried yet is to try out the other PCI-e slot, which is nonsense for one and second I've broken some parts on the back of my case to fit it in already.

But I suspect you had something more machiavellian in mind!

For that one yes definitely; and I'm usually not up to weird conspiracy theories to be honest heh :rolleyes:
 
The only thing I haven't tried yet is to try out the other PCI-e slot, which is nonsense for one and second I've broken some parts on the back of my case to fit it in already.

Ouch. I feel ya. I had to remove a cable management thingie on mine that was directly behind the PCIe slot --had to have that extra 1/2" or no go. We asked them if they'd prepared a chassis compatibility list like both IHVs have down with PSUs for SLI/CrossFire. . . . I still think that would be a good idea. If AMD's new flagship is of similar length, I hope they will consider it.
 
I think at this point, it's fairly clear it's not currently exposed whatsoever. The question, of course, is whether future drivers will expose it...
I recently found a NVIDIA doc (from techday iirc, but it wasn't published so I'll assume it's still that it's still under NDA, theorically speaking!) that lists the G800 as having ~345GFlops, and not ~500GFlops, btw.
OK, well that looks to me, then, as though NVidia's marketing department got a hold of the MUL.

The correct way to think about it, of course, is that on NV4x, the first MUL was generally used for perspective correction. On G7x, one of the MADDs was also often reserved for perspective correction. For G8x, it never is. As such, considering inefficiencies etc., it might be fairer to compare G80's ~350GFlops with ~125GFlops for the G71's PS, plus ~50GFlops for the G71's VS. Thus, assumign perfect VS-PS distribution for G71's configurastion, G80 should be at least twice as fast in term of arithmetic, and more if it was unbalanced for G71. I think that's roughly what we're seeing, TBH.
That's a good rationalisation, so in a sense we already do have the MUL, but only because it's performing texture address calculation.

At the same time, though, G71 can get the full 16 FLOPs per clock from it's MAD+MAD, so I think characterising its PS as 125GFLOPs is unrealistically conservative. Careful coding can produce rather more, instantaneously. Though if you require all FLOPs to be fp32, then I think some harshness is deserved. Too much hinges on instruction-level parallelism.

Also scalar instruction processing in G80 will easily produce big gains over G71, say an average of 33%.

ATI has had a dedicated MUL for perspective correction since R300, if not much before that! The R500 has one too, and you'd expect R600 to have one.
And ATI's never counted it for GFLOPs marketing. I think asynchronous texturing introduced by R300 is the root of the dedicated MUL - it's always a vec2 operation, isn't it? But actually my GPU history is much too shaky, :LOL:

It is surprising that NVIDIA has decided to no longer let the MUL be used for generic shader programming, while they used to in the past, and it looked like a good design choice to maximize unit utilization. This is especially true given that the SF unit is used to calculate 1/pos.w! It'll be interesting to see what they decide to do in future designs, of course.
I think that's it then, there is no dual-issuable MUL. Score +1 for NVidia marketing.

Jawed
 
Ouch. I feel ya. I had to remove a cable management thingie on mine that was directly behind the PCIe slot --had to have that extra 1/2" or no go. We asked them if they'd prepared a chassis compatibility list like both IHVs have down with PSUs for SLI/CrossFire. . . . I still think that would be a good idea. If AMD's new flagship is of similar length, I hope they will consider it.
Heh, guess I got really lucky! My 8800 GTX just missed one of my hard drives by about 2mm!
 
Missing MUL mystery solved?

It is surprising that NVIDIA has decided to no longer let the MUL be used for generic shader programming, while they used to in the past, and it looked like a good design choice to maximize unit utilization. This is especially true given that the SF unit is used to calculate 1/pos.w! It'll be interesting to see what they decide to do in future designs, of course.
Completely generic dual issue is expensive because you need to worry about routing so much data from the registers. This is probably why NVidia stayed away from this.

BTW, Uttar, have you tried to see if cross-product-like operations are single cycle?
cross(a,b) = madd(a.yzx, b.zxy, -mul(a.zxy, b.yzx))

This type of calculation won't need extra register inputs.

Check this out:
http://graphics.stanford.edu/projects/gpubench/results/
Interstingly, R5xx, NV4x, G7x, and G8x can all do single cycle XPD, but R3xx/R4xx and NV3x can't.

So maybe ATI also has it and is more similar to NVidia than we thought. Maybe both can use the address MUL for shader calculations in very specific circumstances.
 
Last edited by a moderator:
Heh, guess I got really lucky! My 8800 GTX just missed one of my hard drives by about 2mm!

Blah I got still enough space to reach the HDDs to fit another half 8800 in. Thermaltake has on the back those plastic clip thingies so you don't have to screw in each card. A bit of plastic from those was in the way to insert a dual slot GPU.
 
BTW, Uttar, have you tried to see if cross-product-like operations are single cycle?
cross(a,b) = madd(a.yzx, b.zxy, -mul(a.zxy, b.yzx))

This type of calculation won't need extra register inputs.
In a scalar architecture, it takes 4 scalar input registers per vector component.

Check this out:
http://graphics.stanford.edu/projects/gpubench/results/
Interstingly, R5xx, NV4x, G7x, and G8x can all do single cycle XPD, but R3xx/R4xx and NV3x can't.
Given that XPD is a vec3 operation, shouldn't it be either at 133% (if the MUL is usable) or 66% of MAD4?
 
In a scalar architecture, it takes 4 scalar input registers per vector component.


Given that XPD is a vec3 operation, shouldn't it be either at 133% (if the MUL is usable) or 66% of MAD4?
At first I was thinking "Crap, well there goes my theory", but on second thought I'm not so sure.

For one, XPD is executed at exactly the same rate as a 4 component MADD. If the extra MUL wasn't being used, isn't it running faster than possible? 42 Ginst/s of XPD equals 1.5 MULs per stream processor per clock. The thing is that the program asks for a 4 channel writemask. What do you think that means for XPD? Maybe there's some extra work going on in the 4th channel that we're not accounting for which would make it a full 2 MULs per SP per clock. Also note that DP3 and DP4 are executing at a rate of 1.1 MULs per SP per clock.

I also thought about the same thing with the registers, but maybe the access structure is not quite like we think it is. More likely, maybe the execution of the XPD is rearranged cleverly. The 6 MULS needed for an XPD can be grouped into 3 pairs that only need 3 inputs each. Channel replication already requires more than one write port so that's not an issue here.


ASIDE: This test still does show some difference between R5xx and R4xx that makes it more in-line with NV4x. How does R5xx execute this at full rate? It's also doing at least 5.5 MULs per shader unit per clock.
 
Last edited by a moderator:
ARB_fragment_program defines XPD as a vec3 operation: "The w component of the result vector is undefined." So that should not take any extra cycles.
Some of the results are really puzzling, though.

I also thought about the same thing with the registers, but maybe the access structure is not quite like we think it is. More likely, maybe the execution of the XPD is rearranged cleverly. The 6 MULS needed for an XPD can be grouped into 3 pairs that only need 3 inputs each.
Wouldn't that be still 5 cycles if there is a limit of 3 input registers?
cx = ay * bz - az * by
cy = az * bx - ax * bz
cz = ax * by - ay * bx

1: cx = az * by
cz = ax * by
2: cx = ay * bz - cx
3: cy = ax * bz
4: cy = az * bx - cy
5: cz = -ay * bx + cz

Is there a better way?

Channel replication already requires more than one write port so that's not an issue here.
I'm not sure what you mean here. Why would you want to replicate channels when writing to registers?
 
ARB_fragment_program defines XPD as a vec3 operation: "The w component of the result vector is undefined." So that should not take any extra cycles.
Some of the results are really puzzling, though.


Wouldn't that be still 5 cycles if there is a limit of 3 input registers?
Ugh, you're right. Can't pair a mul with a mad without exceeding 3 input registers.

What if there was a single accumulator register that was always accessible and doesn't take away from read ports? You wouldn't need the huge multiplexer to access an arbitrary register, so it would be cheap.

1: tmp = az * by, cz = ax * by
2: cx = ay * bz - tmp, tmp = ax * bz
3: cy = az * bx - tmp
4: cz = -ay * bx + cz

DP3 and DP4 could be accelerated too. Maybe one cycle saved for every two dot products? The results are weird. These things are definately outside the margin of error.

Xmax said:
I'm not sure what you mean here. Why would you want to replicate channels when writing to registers?
I'm just saying that writing two results in a cycle, like in your step #1, shouldn't be an issue. You mentioned that 4 different inputs (read ports) is too many, and I'm just saying 2 outputs (write ports) should be fine. Don't worry about it, I was just thinking out loud I guess.
 
Last edited by a moderator:
What if there was a single accumulator register that was always accessible and doesn't take away from read ports? You wouldn't need the huge multiplexer to access an arbitrary register, so it would be cheap.

1: tmp = az * by, cz = ax * by
2: cx = ay * bz - tmp, tmp = ax * bz
3: cy = az * bx - tmp
4: cz = -ay * bx + cz

DP3 and DP4 could be accelerated too. Maybe one cycle saved for every two dot products? The results are weird. These things are definately outside the margin of error.
That's an interesting idea. I wonder if such an accumulator would be accessible to both the MUL and MADD unit.

From the results it seems like DP3 takes ~2.75 = 11/4 cycles while DP4 is ~3.66 = 11/3. 4 FRC and 3 CMP for every 5 cycles. I'd like to see the actual shaders, maybe there are some hidden costs or optimization opportunities. I would guess the MAD rate is slightly lower because the first MAD uses two interpolants.
 
Back
Top