The Official NVIDIA G80 Architecture Thread

I'd be willing to bet that there's some difficulty fetching enough registers to go wider within the cluster. If you want to tweak something in there, you add more texture addressing units, or decrease the issue width to a quad. If the R600 really winds up being good at either modifying issue width or TLP, then the G80 arch is going to need some revisiting for gpgpu competitiveness.

I'm not really expecting any of that in G81, though. If I wanted to "get crazy", I'd go with just a straightforward scaling. 384->512 and 900->1200 would imply (assuming bw:computation balance doesn't change) 10ish clusters at 750/1800. Of course, the chip wouldn't be any smaller at 80nm than g80 at 90nm if they did that, and I doubt it would run cooler either! Unless G80 has a couple of clusters in there for yield purposes that they can unlock in 80nm, I don't think I would expect even that.

Of course, another possible rabbit is 65nm. Hmm, nah. Kirk hasn't been out vigorously denying 65nm.... ;-)
 
It's not like anyone would bother to revisit the architecture just for GPGPU purposes IMO. That doesn't sell any significant amount of cards.
 
Well, I know that Orton said he expects physics, which is a subset of gpgpu, to shift significant amounts of product by mid-2007. I have no reason to think NVIDIA doesn't see it similarly.
 
The physics will be rather sparely used in games (lowest common denominator, yadda, yadda...), so I think that won't really be a criteria, even if ATI should do it say, twice as fast. It'll be about the "common" gaming usage as always, shader perf, textureing etc. My 2 cents.
 
The physics will be rather sparely used in games (lowest common denominator, yadda, yadda...), so I think that won't really be a criteria, even if ATI should do it say, twice as fast. My 2 cents.

Well, I didn't mean it in a competitive analysis fashion (tho possibly Orton did!). Just that certainly ATI, and probably NV, are thinking its enough of a difference maker from a sales and marketing pov to be revisting architectures about.
 
I'd be willing to bet that there's some difficulty fetching enough registers to go wider within the cluster.
I'd be willing to bet you just implied the R520->R580 transistion never existed, and that ATI redesigned the entire chip in less than 3 months!
Of course, another possible rabbit is 65nm.
Hmm :)


Uttar
 
I'd be willing to bet you just implied the R520->R580 transistion never existed, and that ATI redesigned the entire chip in less than 3 months!

That move came at the price of a wider batch/poorer dbr scaling, though.
If R600 really turns out to have exceptional dbr performance, going wider in g81 is going to be ... interesting.

Consider an alternative. "The missing MUL" would require 2 operands over 16 channels to activate. If, instead, the 8x2 shaders became 8x3, the additional 8 shaders per cluster would require a max of 3 ops each. That's actually a lower maximum in the way of bandwidth requirements (not counting sfu). Given that activation of the MUL isn't going to raise performance nearly as much in reality as on a gflop scoresheet, I know which one I would wish for :)
 
Secondly, ATI's units are Vec3+Scalar, which is obviously less efficient than four purely scalar units. In the worst possible case, it's half as efficient; on average, it's certainly not quite that bad. Furthermore, the R580 also has extra ADD units. They aren't always usable and/or exposed, but they still are far from dormant.

You keep making these bold claims. How do you know this is true? Haven't some instructions on G80 gone from single cycle to 4 cycles? Keep in mind that G80 has to also share the 128 ALUs for vertex shaders.
 
You keep making these bold claims. How do you know this is true? Haven't some instructions on G80 gone from single cycle to 4 cycles? Keep in mind that G80 has to also share the 128 ALUs for vertex shaders.

The worst case scenario for a vec3+scalar is to co-issue two scalar instructions => 50% ALU utilization where G80 would be at 100% utilization. They aren't really guesses - these things are pretty obvious. It doesn't have anything to do with cycles/instruction either. Now what's also obvious is that utilization on a vec3+scalar arrangement may not be bad at all on average.
 
I'd be willing to bet that there's some difficulty fetching enough registers to go wider within the cluster.

Why's that?

I don't believe the register file and ALUs are organized like a junior-high dance: all the registers on one side and all the ALUs on the other. I'd expect them to be paired pretty tightly -- some subset of the register file is local to each unit in the SIMD array. This assumes that each vertex/pixel is permanently assigned to one ALU + register segment so that most accesses are local. That way the total size and bandwidth of the register file scales with ALUs, but the port requirements per segment are constant.
 
The worst case scenario for a vec3+scalar is to co-issue two scalar instructions => 50% ALU utilization where G80 would be at 100% utilization. They aren't really guesses - these things are pretty obvious. It doesn't have anything to do with cycles/instruction either. Now what's also obvious is that utilization on a vec3+scalar arrangement may not be bad at all on average.

The worst case is a scalar instruction that could not pair with any other instruction.
 
Why's that?

I don't believe the register file and ALUs are organized like a junior-high dance: all the registers on one side and all the ALUs on the other. I'd expect them to be paired pretty tightly

Yeah, so did I, until I saw that patent Jawed threw around.
Regardless, 90->80 doesn't give you a lot of room. Doubling ALU count + register width sounds like a lot of trannies to me, so I was keeping one of them constant....
 
The worst case scenario for a vec3+scalar is to co-issue two scalar instructions => 50% ALU utilization where G80 would be at 100% utilization. They aren't really guesses - these things are pretty obvious. It doesn't have anything to do with cycles/instruction either. Now what's also obvious is that utilization on a vec3+scalar arrangement may not be bad at all on average.
The worst case on G80 is 0% utilisation of the primary ALU:

DP4 r0.w, r1, r2
RSQ r0.w, r0.w
MUL r3, r4, r0.w

the MUL has to wait until the RSQ has completed (which has to wait until the DP4 has completed). So while the SF ALU is working on the RSQ, the primary ALU is idle, which is four clocks.

16 fragments x 4 clocks looks unhealthy in comparison with a vec3+SF GPU: 4 batches, each of 4 fragments x 1 clock (both GPUs considered as having 16-SIMD ALUs). The latter is 4x faster on the MUL. But that's an extreme case, generally G80 wins out significantly.

Jawed
 
The worst case on G80 is 0% utilisation of the primary ALU:
Did you actually measure the performance of that code on a G80? You'd be limited to 24 pixels/clock from the ROPs, not from the shader.

If you repeat the same piece of code 100 times (say), and if the compiler couldn't optimize any of it away, you're still looking at [strike]5[/strike] 8 clocks to run one iteration per ALUs.

Edit: Minor clarification.
Edit2: Fixed scalar vs vector MUL confusion.
 
Last edited by a moderator:
The worst case on G80 is 0% utilisation of the primary ALU:

DP4 r0.w, r1, r2
RSQ r0.w, r0.w
MUL r3, r4, r0.w

the MUL has to wait until the RSQ has completed (which has to wait until the DP4 has completed). So while the SF ALU is working on the RSQ, the primary ALU is idle, which is four clocks.

16 fragments x 4 clocks looks unhealthy in comparison with a vec3+SF GPU: 4 batches, each of 4 fragments x 1 clock (both GPUs considered as having 16-SIMD ALUs). The latter is 4x faster on the MUL. But that's an extreme case, generally G80 wins out significantly.

Jawed

Why do you think it has 16 shaders tied to the same instruction? Isn't G80 MIMD?

Also, your example is contrived and probably could easily be prevented at final compile time in the driver... not to mention the scheduling could hide it.

Maybe I misunderstood...
 
The worst case on G80 is 0% utilisation of the primary ALU:

DP4 r0.w, r1, r2
RSQ r0.w, r0.w
MUL r3, r4, r0.w

the MUL has to wait until the RSQ has completed (which has to wait until the DP4 has completed). So while the SF ALU is working on the RSQ, the primary ALU is idle, which is four clocks.

I don't think I fully get the example (brain is fried - just got home from work) but couldn't you contrive something similiar to get similiarly poor utilization of the vec3? Also if I understand what you're trying to demonstrate - this 4-clock idle time during the RSQ is specific to G80's implementation, not to scalar architectures in general.
 
Normally, i would put this in a R5xx or R600 thread, but everything (which isn't much) that is said about them is already pretty much established by now.
What i didn't know was that Nvidia apparently had "something up the sleeve" in March.

http://www.vr-zone.com/?i=4400

Since they said that immediately after the R600 bit, we can assume it's a response to the former, and not G84/G86.
What's the word on this thing, an overclocked 90nm G80, a 80nm die shrink and an overclocked G80-based design (a la G70->G71), or a real -design- improvement (this last one seems a bit too soon to be true) ? Is this even credible ?
I was under the impression that any response to R600 would come no sooner than June through September 2007.
 
Last edited by a moderator:
Maybe thats when Nvidia will unleash the true power of the G80 with magical drivers? Could also be shrink to 80 nm with GDDR4, seeing as March would put G80 at 4-5 months.
 
Back
Top