NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.
If both IHVs have merely increased the amount of ALUs per cluster, I have severe doubts that 2008 will be an interesting year. They might increase arithmetic throughput on paper with such a "quick and dirty" trick, yet not overall efficiency.
 
If both IHVs have merely increased the amount of ALUs per cluster, I have severe doubts that 2008 will be an interesting year. They might increase arithmetic throughput on paper with such a "quick and dirty" trick, yet not overall efficiency.
Well, such G100 would still have 50% more processors than G80 (24 vs 16) so it should be quite faster anyway. But i'd prefer the same amount of SPs/processor myself.
 
Well, such G100 would still have 50% more processors than G80 (24 vs 16) so it should be quite faster anyway. But i'd prefer the same amount of SPs/processor myself.

Having the same amount of SPs per processor would be another scenario that could eventually under circumstances make sense. Instead of having MADD+MUL, what exactly speaks against having more than 3 FLOPs/SP (whereby the third FLOP is rather rarely used for general shading afaik)?

I'm not saying it is or will be, I'm merely pointing out (as again in the past) that sterile numbers are just that sterile numbers and they mean jacksh*t until more details surface about the internas of an architecture and per FLOP overall efficiency. A G71/GX2 has on paper more theoretical MADD FLOPs than a G80...errrr so what?
 
Just a random idea: increase the number of ALUs per Interpolation/SFU unit. The current ratio is very high and it seems simpler to me (and more efficient) to change that rather than try to expose the MUL better...
 
Just a random idea: increase the number of ALUs per Interpolation/SFU unit. The current ratio is very high and it seems simpler to me (and more efficient) to change that rather than try to expose the MUL better...
It would require re-engineering of the operand paths (more bandwidth!) to enable this performance gain, I guess. It's certainly an intriguing idea.

Jawed
 
Just a random idea: increase the number of ALUs per Interpolation/SFU unit. The current ratio is very high and it seems simpler to me (and more efficient) to change that rather than try to expose the MUL better...

I wasn't thinking of "just exposing" the MUL better.
 
It would require re-engineering of the operand paths (more bandwidth!) to enable this performance gain, I guess. It's certainly an intriguing idea.
I'm not sure if I'm misunderstanding what I said or if it's the other way around! :) My idea was to change the ratio, which can be accomplished not only by increasing the number of SPs per multiprocessor but also by decreasing the number of interpolators per multiprocessor. Remember that the two interpolators are 4-wide, while the ALU is 8-wide... That doesn't mean you couldn't make the ALU wider instead, but both are possible.

And Ail, I know that's not what you were thinking, but the fact remains that all alternatives to what I'm proposing imply rather substantial complications of the multiprocessor architecture, while this would simply make it leaner instead. It would indirectly increase the proportion of control logic though, so that's a problem up to a certain extend.
 
I'm not sure if I'm misunderstanding what I said or if it's the other way around! :) My idea was to change the ratio, which can be accomplished not only by increasing the number of SPs per multiprocessor but also by decreasing the number of interpolators per multiprocessor. Remember that the two interpolators are 4-wide, while the ALU is 8-wide... That doesn't mean you couldn't make the ALU wider instead, but both are possible.
Well, presuming that the "missing MUL" is missing because of a register-file bandwidth shortfall, then an increase in that bandwidth would serve both to enable the MUL and to support an increase in the attribute-interpolation:TMU ratio, which I think is too low currently.

But you're also describing an alternative that I wasn't considering: widening the multiprocessors or adding multiprocessors to each cluster, effectively lowering the MAD:MI ratio (i.e. to 8:4 instead of 8:8). But you'd have to either widen the multiprocessors to 24 MADs (24:12 MAD:MI) or add at least 3 multiprocessors to get a net gain in interpolation:TMU ratio :cry:

Generally I don't like the sound of anything that would change the ALU compiler. If the "missing MUL" is enabled (maybe it already makes the occasional appearance in graphics, but apparently doesn't do so in CUDA) that's already a significant change. At least to CUDA.

I think increasing MAD:MI ratio would be pretty risky, SIN/COS/EXP are already 1/8th rate...

Jawed
 
would serve both to enable the MUL and to support an increase in the attribute-interpolation:TMU ratio, which I think is too low currently.
The interpolation:TMU ratio is too high right now, not too low; you're just not looking at the right numbers AFAICT. You've got 64 TMUs and 128 SPs, the latter being at >2x the clock of the TMUs so let's just say 256 SPs. For a 2D texture, you need two cycle to do the necessary interpolation. So you divide 256 by 2, and then by 64, and you realize you've got twice the inteprolation power than you'd ever need to keep your TMUs at full capacity. Trust me, this is right, and bilinear not being full-speed has nothing to do with interpolation AFAICT.

As the average program size increases and as you increase the number of SPs per TMU, it makes sense to reduce the number of interpolators compared to the number of MADs. And remember that by halving it and doubling the ALU:TEX ratio, the Interp:TEX ratio already stays constant!

Generally I don't like the sound of anything that would change the ALU compiler. If the "missing MUL" is enabled (maybe it already makes the occasional appearance in graphics, but apparently doesn't do so in CUDA) that's already a significant change. At least to CUDA.
Errr, I'm proposing NOT exposing the MUL... :) And I don't expect any change in the number of interpolation units per MAD to change the compiler noticeably.

I think increasing MAD:MI ratio would be pretty risky, SIN/COS/EXP are already 1/8th rate...
Not sure how you count that. They're pretty clearly 1/4th rate; this would make them 1/8th.
 
The interpolation:TMU ratio is too high right now, not too low; you're just not looking at the right numbers AFAICT. You've got 64 TMUs and 128 SPs, the latter being at >2x the clock of the TMUs so let's just say 256 SPs. For a 2D texture, you need two cycle to do the necessary interpolation. So you divide 256 by 2, and then by 64, and you realize you've got twice the inteprolation power than you'd ever need to keep your TMUs at full capacity. Trust me, this is right, and bilinear not being full-speed has nothing to do with interpolation AFAICT.
Fair enough, I think it was your theory originally. I never did the math :oops: just looked at those bilinear speeds and thought whoops.

As the average program size increases and as you increase the number of SPs per TMU, it makes sense to reduce the number of interpolators compared to the number of MADs. And remember that by halving it and doubling the ALU:TEX ratio, the Interp:TEX ratio already stays constant!
Agreed with all that.

Errr, I'm proposing NOT exposing the MUL... :) And I don't expect any change in the number of interpolation units per MAD to change the compiler noticeably.
I agree the compiler's algorithm won't change overtly (it already has to cope with varying intensities of TEX instructions).

Not sure how you count that. They're pretty clearly 1/4th rate; this would make them 1/8th.
RCP/RSQRT/LOG are 1/4 rate. SIN/COS/EXP that I'm referring to are the "fast", less-precise versions (__sin etc.). I don't know what speed the regular versions are, presumably they're even slower!

Jawed
 
The regular versions don't use the interpolator, only the MAD pipeline, and they are substantially slower, yeah. All SFU functions are much more precise than their SSE equivalent though, although less so than standard equivalents. SIN/COS/etc. also are 1/4th AFAICT - or are you refering to something else? Maybe the fact that they also require one MAD to put the input in the desired range? (although that'd take one cycle, not 4 like the SFU - and it's done in parallel!)
 
UVD was 4.7mm2 on 65nm. That would make it about 3.5-4mm² on 55nm. Since native 2D engines have been dropped some time ago, I consider this hardly noteworthy. Furthermore, i´m not sure why you come to the conclusion that AMD´s higher PCB costs. Higher than a single RV670 SKU? Of course. Higher costs compared to an upcoming high-end NV SKU? Certainly not the case.
Granted, UVD is very small and since there is no dedicated VGA-Logic anymore, which i just learned, the point of wasted transistors seems to bei minuscule.

But i stand by what i've said wrt higher PCB costs. You need - at my current point of knowledge - routing from PCIe to bridge, then to both GPUs. Additionally, you need routing for a dual memory-interface (even at half width i recon it to be more difficult) and you need two substrates to put the die upon (this may go away soon too). You have to distribute power evenly over a larger pcb area (to both GPUs and mem-arrays), which is, afaibt, not so trivial as it may seem.

These are certainly good and valid points, but it´s questionable if NV really wants to disable some units on an high-end ASIC, where the margins are very high to begin with.
I've looked at it this way: If you're about to release the fastest graphics card the world has ever seen(TM) - as do AMD and Nvidia every year - then you could face some problems wrt yield if you're into large, monolithic GPUs. Now, if your assumption is true, then it would be perfectly feasible to first come out with an ace up your sleeve, since you can still command a premium price for your product. You get the marketing-buzz, you get the money and you get "improved yields" - just keeping fully functional chips until you've fine-tuned your manufacturing process to the needs of the new GPU. Then, with no additional R&D, you can release a higher Speedgrade and - guess what - still get your premium price.

At least this was my stream of ideas when writing that.
 
The regular versions don't use the interpolator, only the MAD pipeline, and they are substantially slower, yeah. All SFU functions are much more precise than their SSE equivalent though, although less so than standard equivalents. SIN/COS/etc. also are 1/4th AFAICT - or are you refering to something else? Maybe the fact that they also require one MAD to put the input in the desired range? (although that'd take one cycle, not 4 like the SFU - and it's done in parallel!)
I'm referring to the CUDA Programming Guide, 0.8.2, sections 6.1.1.1 and Appendix A.

Are the fast transcendentals of the required precision for SM4? (Struggling to find a comprehensive definition.) Not sure if this is a case of "good enough for SM4" versus "needed for CUDA".

I wonder if R6xx has fast and slow single-precision transcendentals - the patent for SIN/COS indicates 24-bit precision but that's subject to embodiment ...

Jawed
 
But i stand by what i've said wrt higher PCB costs. You need - at my current point of knowledge - routing from PCIe to bridge, then to both GPUs. Additionally, you need routing for a dual memory-interface (even at half width i recon it to be more difficult) and you need two substrates to put the die upon (this may go away soon too). You have to distribute power evenly over a larger pcb area (to both GPUs and mem-arrays), which is, afaibt, not so trivial as it may seem.
I dare say I'm expecting GT200 to be 512-bit so I expect the PCB to be expensive because of that...

As for R780 (2xRV770) if it's a single substrate with dual-dies it'll be 512-bit too I expect.

Which would be more expensive? hmm...

I've looked at it this way: If you're about to release the fastest graphics card the world has ever seen(TM) - as do AMD and Nvidia every year - then you could face some problems wrt yield if you're into large, monolithic GPUs. Now, if your assumption is true, then it would be perfectly feasible to first come out with an ace up your sleeve, since you can still command a premium price for your product. You get the marketing-buzz, you get the money and you get "improved yields" - just keeping fully functional chips until you've fine-tuned your manufacturing process to the needs of the new GPU. Then, with no additional R&D, you can release a higher Speedgrade and - guess what - still get your premium price.
I think one variable that's possibly getting less attention than it should is "professional". NVidia's revenues for professional GPUs appear to be in the region of 1/3 of all products (hope I've got that right), so the effect on margin for the high-end "risk" GPUs at the start of a new product, which command an exponentially-priced premium, is prolly very useful :p

Jawed
 
I'm referring to the CUDA Programming Guide, 0.8.2, sections 6.1.1.1 and Appendix A.
What the hell, that's strange. I'll investigate; because in 3D APIs, all that stuff is 1/4th speed, not 1/8th. For example: http://graphics.stanford.edu/projects/gpubench/results/8800GTX-0003/
Are the fast transcendentals of the required precision for SM4?
The SFU was designed around SM4, obviously.
I wonder if R6xx has fast and slow single-precision transcendentals - the patent for SIN/COS indicates 24-bit precision but that's subject to embodiment ...
Why would it? No architecture ever had something like that AFAIK. On G8x, the "slow" sin/cos/etc. for CUDA is done using the MAD unit. There is zero specialized hardware there...

Oh and it looks like we've been missing the obvious place to look for data on future architecture: the CUDA guide v1.1!
CUDA said:
32-bit integer multiplication takes 16 clock cycles, but __mul24 and __umul24 (see Appendix B) provide signed and unsigned 24-bit integer multiplication in 4 clock cycles. On future architectures however, __mul24 will be slower than 32-bit integer multiplication, so we recommend to provide two kernels, one using __mul24 and the other using generic 32-bit integer multiplication, to be called appropriately by the application.
That is quite intriguing to say the least and I'm not sure what the implications are - hmmm!

Regarding professional GPUs at NV, afaik the ratio is a bit smaller now because the desktop and laptop revenue has grown so incredibly fast in the last 12-18 months, while professional only increased at a much more normal rate. Not sure what the ratio is now thus, but I could check - I think it's probably 1/4 or 1/5...
 
What the hell, that's strange. I'll investigate; because in 3D APIs, all that stuff is 1/4th speed, not 1/8th. For example: http://graphics.stanford.edu/projects/gpubench/results/8800GTX-0003/

The SFU was designed around SM4, obviously.
OpenGL/SM3 versus SM4? I presume SM4 is tighter, though I know from:

http://download.microsoft.com/download/f/2/d/f2d5ee2c-b7ba-4cd0-9686-b6508b5479a1/Direct3D10_web.pdf

that precision is not particularly tight. Sigh...

Why would it? No architecture ever had something like that AFAIK. On G8x, the "slow" sin/cos/etc. for CUDA is done using the MAD unit. There is zero specialized hardware there...
Clearly there's potential for a disconnect between the "almost IEEE 754" implementations for GPGPU and the "D3D10 version of almost IEEE 754"... So, did ATI just do it all in the trascendental unit? I suspect so, but...

Oh and it looks like we've been missing the obvious place to look for data on future architecture: the CUDA guide v1.1!

That is quite intriguing to say the least and I'm not sure what the implications are - hmmm!
Hmm, well it could imply G8x integer multiplication is performed on the FP32 MAD, but the new improved FP ALU, perhaps due to adaptations for multi-clocked/ganged double-precision, is now too narrow?

Jawed
 
I dare say I'm expecting GT200 to be 512-bit so I expect the PCB to be expensive because of that...

As for R780 (2xRV770) if it's a single substrate with dual-dies it'll be 512-bit too I expect.

Which would be more expensive? hmm...
Hm - i did not consider this possibility: Dual-die, single substrate. If that's the case, then most of my points will indeed be moot.

I think one variable that's possibly getting less attention than it should is "professional". NVidia's revenues for professional GPUs appear to be in the region of 1/3 of all products (hope I've got that right), so the effect on margin for the high-end "risk" GPUs at the start of a new product, which command an exponentially-priced premium, is prolly very useful :p

Jawed
Sorry, i'm not quite surre what to make of this. What kind of effect do you mean?
 
Sorry, i'm not quite surre what to make of this. What kind of effect do you mean?
Professional products increase the average selling price of a chip, and the highend chips get the biggest boost. This will reduce the pain incurred with a "lower-yielding" huge chip, increasing the effective margin.

So a 450mm2 (guess) GT200 might sound "risky" but a lot of risk will be offset by 10-20x higher selling price of a percentage of these chips.

Looks like Arun will quantify the effect.

Jawed
 
What the hell, that's strange. I'll investigate; because in 3D APIs, all that stuff is 1/4th speed, not 1/8th. For example: http://graphics.stanford.edu/projects/gpubench/results/8800GTX-0003/
The SFU was designed around SM4, obviously.
Why would it? No architecture ever had something like that AFAIK. On G8x, the "slow" sin/cos/etc. for CUDA is done using the MAD unit. There is zero specialized hardware there...

Oh and it looks like we've been missing the obvious place to look for data on future architecture: the CUDA guide v1.1!
That is quite intriguing to say the least and I'm not sure what the implications are - hmmm!
Hehe yeah i saw that too (iirc they mentioned at least one other thing about future architectures in the 1.1 guide). So there are 2 possibilities, __umul24 got slower or 32bit integer mul got faster (probably the latter). But it doesn't really make sense to me, if you can do a 32bit integer mul in x cycles, seems weird that you can't do an unsigned 24bit mul using the same datapath within the same x cycles. You can just use the lower portion of the 32bit integer mul to do the unsigned 24bit mul.
 
Hehe yeah i saw that too (iirc they mentioned at least one other thing about future architectures in the 1.1 guide). So there are 2 possibilities, __umul24 got slower or 32bit integer mul got faster (probably the latter). But it doesn't really make sense to me, if you can do a 32bit integer mul in x cycles, seems weird that you can't do an unsigned 24bit mul using the same datapath within the same x cycles. You can just use the lower portion of the 32bit integer mul to do the unsigned 24bit mul.
Ah, hang on, could this be due to overflow (combined with sign handling?)? The 32-bit multiplication can't generate a 24-bit-correct overflow result, without doing more work?

Jawed
 
Status
Not open for further replies.
Back
Top