Well, such G100 would still have 50% more processors than G80 (24 vs 16) so it should be quite faster anyway. But i'd prefer the same amount of SPs/processor myself.If both IHVs have merely increased the amount of ALUs per cluster, I have severe doubts that 2008 will be an interesting year. They might increase arithmetic throughput on paper with such a "quick and dirty" trick, yet not overall efficiency.
Well, such G100 would still have 50% more processors than G80 (24 vs 16) so it should be quite faster anyway. But i'd prefer the same amount of SPs/processor myself.
It would require re-engineering of the operand paths (more bandwidth!) to enable this performance gain, I guess. It's certainly an intriguing idea.Just a random idea: increase the number of ALUs per Interpolation/SFU unit. The current ratio is very high and it seems simpler to me (and more efficient) to change that rather than try to expose the MUL better...
Just a random idea: increase the number of ALUs per Interpolation/SFU unit. The current ratio is very high and it seems simpler to me (and more efficient) to change that rather than try to expose the MUL better...
I'm not sure if I'm misunderstanding what I said or if it's the other way around! My idea was to change the ratio, which can be accomplished not only by increasing the number of SPs per multiprocessor but also by decreasing the number of interpolators per multiprocessor. Remember that the two interpolators are 4-wide, while the ALU is 8-wide... That doesn't mean you couldn't make the ALU wider instead, but both are possible.It would require re-engineering of the operand paths (more bandwidth!) to enable this performance gain, I guess. It's certainly an intriguing idea.
Well, presuming that the "missing MUL" is missing because of a register-file bandwidth shortfall, then an increase in that bandwidth would serve both to enable the MUL and to support an increase in the attribute-interpolation:TMU ratio, which I think is too low currently.I'm not sure if I'm misunderstanding what I said or if it's the other way around! My idea was to change the ratio, which can be accomplished not only by increasing the number of SPs per multiprocessor but also by decreasing the number of interpolators per multiprocessor. Remember that the two interpolators are 4-wide, while the ALU is 8-wide... That doesn't mean you couldn't make the ALU wider instead, but both are possible.
The interpolation:TMU ratio is too high right now, not too low; you're just not looking at the right numbers AFAICT. You've got 64 TMUs and 128 SPs, the latter being at >2x the clock of the TMUs so let's just say 256 SPs. For a 2D texture, you need two cycle to do the necessary interpolation. So you divide 256 by 2, and then by 64, and you realize you've got twice the inteprolation power than you'd ever need to keep your TMUs at full capacity. Trust me, this is right, and bilinear not being full-speed has nothing to do with interpolation AFAICT.would serve both to enable the MUL and to support an increase in the attribute-interpolation:TMU ratio, which I think is too low currently.
Errr, I'm proposing NOT exposing the MUL... And I don't expect any change in the number of interpolation units per MAD to change the compiler noticeably.Generally I don't like the sound of anything that would change the ALU compiler. If the "missing MUL" is enabled (maybe it already makes the occasional appearance in graphics, but apparently doesn't do so in CUDA) that's already a significant change. At least to CUDA.
Not sure how you count that. They're pretty clearly 1/4th rate; this would make them 1/8th.I think increasing MAD:MI ratio would be pretty risky, SIN/COS/EXP are already 1/8th rate...
Fair enough, I think it was your theory originally. I never did the math just looked at those bilinear speeds and thought whoops.The interpolation:TMU ratio is too high right now, not too low; you're just not looking at the right numbers AFAICT. You've got 64 TMUs and 128 SPs, the latter being at >2x the clock of the TMUs so let's just say 256 SPs. For a 2D texture, you need two cycle to do the necessary interpolation. So you divide 256 by 2, and then by 64, and you realize you've got twice the inteprolation power than you'd ever need to keep your TMUs at full capacity. Trust me, this is right, and bilinear not being full-speed has nothing to do with interpolation AFAICT.
Agreed with all that.As the average program size increases and as you increase the number of SPs per TMU, it makes sense to reduce the number of interpolators compared to the number of MADs. And remember that by halving it and doubling the ALU:TEX ratio, the Interp:TEX ratio already stays constant!
I agree the compiler's algorithm won't change overtly (it already has to cope with varying intensities of TEX instructions).Errr, I'm proposing NOT exposing the MUL... And I don't expect any change in the number of interpolation units per MAD to change the compiler noticeably.
RCP/RSQRT/LOG are 1/4 rate. SIN/COS/EXP that I'm referring to are the "fast", less-precise versions (__sin etc.). I don't know what speed the regular versions are, presumably they're even slower!Not sure how you count that. They're pretty clearly 1/4th rate; this would make them 1/8th.
Granted, UVD is very small and since there is no dedicated VGA-Logic anymore, which i just learned, the point of wasted transistors seems to bei minuscule.UVD was 4.7mm2 on 65nm. That would make it about 3.5-4mm² on 55nm. Since native 2D engines have been dropped some time ago, I consider this hardly noteworthy. Furthermore, i´m not sure why you come to the conclusion that AMD´s higher PCB costs. Higher than a single RV670 SKU? Of course. Higher costs compared to an upcoming high-end NV SKU? Certainly not the case.
I've looked at it this way: If you're about to release the fastest graphics card the world has ever seen(TM) - as do AMD and Nvidia every year - then you could face some problems wrt yield if you're into large, monolithic GPUs. Now, if your assumption is true, then it would be perfectly feasible to first come out with an ace up your sleeve, since you can still command a premium price for your product. You get the marketing-buzz, you get the money and you get "improved yields" - just keeping fully functional chips until you've fine-tuned your manufacturing process to the needs of the new GPU. Then, with no additional R&D, you can release a higher Speedgrade and - guess what - still get your premium price.These are certainly good and valid points, but it´s questionable if NV really wants to disable some units on an high-end ASIC, where the margins are very high to begin with.
I'm referring to the CUDA Programming Guide, 0.8.2, sections 6.1.1.1 and Appendix A.The regular versions don't use the interpolator, only the MAD pipeline, and they are substantially slower, yeah. All SFU functions are much more precise than their SSE equivalent though, although less so than standard equivalents. SIN/COS/etc. also are 1/4th AFAICT - or are you refering to something else? Maybe the fact that they also require one MAD to put the input in the desired range? (although that'd take one cycle, not 4 like the SFU - and it's done in parallel!)
I dare say I'm expecting GT200 to be 512-bit so I expect the PCB to be expensive because of that...But i stand by what i've said wrt higher PCB costs. You need - at my current point of knowledge - routing from PCIe to bridge, then to both GPUs. Additionally, you need routing for a dual memory-interface (even at half width i recon it to be more difficult) and you need two substrates to put the die upon (this may go away soon too). You have to distribute power evenly over a larger pcb area (to both GPUs and mem-arrays), which is, afaibt, not so trivial as it may seem.
I think one variable that's possibly getting less attention than it should is "professional". NVidia's revenues for professional GPUs appear to be in the region of 1/3 of all products (hope I've got that right), so the effect on margin for the high-end "risk" GPUs at the start of a new product, which command an exponentially-priced premium, is prolly very usefulI've looked at it this way: If you're about to release the fastest graphics card the world has ever seen(TM) - as do AMD and Nvidia every year - then you could face some problems wrt yield if you're into large, monolithic GPUs. Now, if your assumption is true, then it would be perfectly feasible to first come out with an ace up your sleeve, since you can still command a premium price for your product. You get the marketing-buzz, you get the money and you get "improved yields" - just keeping fully functional chips until you've fine-tuned your manufacturing process to the needs of the new GPU. Then, with no additional R&D, you can release a higher Speedgrade and - guess what - still get your premium price.
What the hell, that's strange. I'll investigate; because in 3D APIs, all that stuff is 1/4th speed, not 1/8th. For example: http://graphics.stanford.edu/projects/gpubench/results/8800GTX-0003/I'm referring to the CUDA Programming Guide, 0.8.2, sections 6.1.1.1 and Appendix A.
The SFU was designed around SM4, obviously.Are the fast transcendentals of the required precision for SM4?
Why would it? No architecture ever had something like that AFAIK. On G8x, the "slow" sin/cos/etc. for CUDA is done using the MAD unit. There is zero specialized hardware there...I wonder if R6xx has fast and slow single-precision transcendentals - the patent for SIN/COS indicates 24-bit precision but that's subject to embodiment ...
That is quite intriguing to say the least and I'm not sure what the implications are - hmmm!CUDA said:32-bit integer multiplication takes 16 clock cycles, but __mul24 and __umul24 (see Appendix B) provide signed and unsigned 24-bit integer multiplication in 4 clock cycles. On future architectures however, __mul24 will be slower than 32-bit integer multiplication, so we recommend to provide two kernels, one using __mul24 and the other using generic 32-bit integer multiplication, to be called appropriately by the application.
OpenGL/SM3 versus SM4? I presume SM4 is tighter, though I know from:What the hell, that's strange. I'll investigate; because in 3D APIs, all that stuff is 1/4th speed, not 1/8th. For example: http://graphics.stanford.edu/projects/gpubench/results/8800GTX-0003/
The SFU was designed around SM4, obviously.
Clearly there's potential for a disconnect between the "almost IEEE 754" implementations for GPGPU and the "D3D10 version of almost IEEE 754"... So, did ATI just do it all in the trascendental unit? I suspect so, but...Why would it? No architecture ever had something like that AFAIK. On G8x, the "slow" sin/cos/etc. for CUDA is done using the MAD unit. There is zero specialized hardware there...
Hmm, well it could imply G8x integer multiplication is performed on the FP32 MAD, but the new improved FP ALU, perhaps due to adaptations for multi-clocked/ganged double-precision, is now too narrow?Oh and it looks like we've been missing the obvious place to look for data on future architecture: the CUDA guide v1.1!
That is quite intriguing to say the least and I'm not sure what the implications are - hmmm!
Hm - i did not consider this possibility: Dual-die, single substrate. If that's the case, then most of my points will indeed be moot.I dare say I'm expecting GT200 to be 512-bit so I expect the PCB to be expensive because of that...
As for R780 (2xRV770) if it's a single substrate with dual-dies it'll be 512-bit too I expect.
Which would be more expensive? hmm...
Sorry, i'm not quite surre what to make of this. What kind of effect do you mean?I think one variable that's possibly getting less attention than it should is "professional". NVidia's revenues for professional GPUs appear to be in the region of 1/3 of all products (hope I've got that right), so the effect on margin for the high-end "risk" GPUs at the start of a new product, which command an exponentially-priced premium, is prolly very useful
Jawed
Professional products increase the average selling price of a chip, and the highend chips get the biggest boost. This will reduce the pain incurred with a "lower-yielding" huge chip, increasing the effective margin.Sorry, i'm not quite surre what to make of this. What kind of effect do you mean?
Hehe yeah i saw that too (iirc they mentioned at least one other thing about future architectures in the 1.1 guide). So there are 2 possibilities, __umul24 got slower or 32bit integer mul got faster (probably the latter). But it doesn't really make sense to me, if you can do a 32bit integer mul in x cycles, seems weird that you can't do an unsigned 24bit mul using the same datapath within the same x cycles. You can just use the lower portion of the 32bit integer mul to do the unsigned 24bit mul.What the hell, that's strange. I'll investigate; because in 3D APIs, all that stuff is 1/4th speed, not 1/8th. For example: http://graphics.stanford.edu/projects/gpubench/results/8800GTX-0003/
The SFU was designed around SM4, obviously.
Why would it? No architecture ever had something like that AFAIK. On G8x, the "slow" sin/cos/etc. for CUDA is done using the MAD unit. There is zero specialized hardware there...
Oh and it looks like we've been missing the obvious place to look for data on future architecture: the CUDA guide v1.1!
That is quite intriguing to say the least and I'm not sure what the implications are - hmmm!
Ah, hang on, could this be due to overflow (combined with sign handling?)? The 32-bit multiplication can't generate a 24-bit-correct overflow result, without doing more work?Hehe yeah i saw that too (iirc they mentioned at least one other thing about future architectures in the 1.1 guide). So there are 2 possibilities, __umul24 got slower or 32bit integer mul got faster (probably the latter). But it doesn't really make sense to me, if you can do a 32bit integer mul in x cycles, seems weird that you can't do an unsigned 24bit mul using the same datapath within the same x cycles. You can just use the lower portion of the 32bit integer mul to do the unsigned 24bit mul.