NVIDIA confirms Next-Gen close to 1TFlop in 4Q07

Another question:

Did NV ever published the MADD+MUL value of the GeForce8-cards or is the only public value the 345GFLOPs MADD in the CUDA documents?
 
Yeah, don't get me wrong - I actually think 192SPs at 2.4GHz+ is very likely. Even if I think it's likely, that doesn't mean I wouldn't be positively surprised anyway though!

When the NV4x architecture got an upgrade and turned into the G7x line, many have referenced the upgraded second MADD (was a MUL in the 6 series, became a MADD in the GF7's).

Could this MUL in the G80 be "upgraded" to a MADD unit in the upcoming G92 ?
Would that (like in the earlier GF6 -> GF7 transition) increase the per-clock efficiency of the GPU or, rather, complicate things even further in the DX10 era ?
 
When the NV4x architecture got an upgrade and turned into the G7x line, many have referenced the upgraded second MADD (was a MUL in the 6 series, became a MADD in the GF7's).

Could this MUL in the G80 be "upgraded" to a MADD unit in the upcoming G92 ?
Would that (like in the earlier GF6 -> GF7 transition) increase the per-clock efficiency of the GPU or, rather, complicate things even further in the DX10 era ?

http://forum.beyond3d.com/showpost.php?p=979657&postcount=690
 
There are so many combinations, it's hard to know which one is being talked about.

1 Teraflop of MADD-only DP performance would be *crazy*-awesome.
<1 Teraflop of MADD+MUL SP performance would be pretty tame. Especially if the MUL remains as accessible as it currently is :) That would put MADD throughput at only slightly above R600's.

Then you might consider the possibility that the texture unit hardware could be "borrowed" when needed. And/or vice-versa.

And then, after you consider thinking about what it would mean to run int8 filtering on fp32 hardware *efficiently*, one can consider int32 throughput, and then dp....

The rumored flop count is interesting, but I would say that the only thing it demonstrates at this point is that there doesn't appear to be any rush to hit the teraflop number. From a marketing standpoint, that seems odd.
 
Reading the extremetech CUDA article has this little tidbit:
According to Nvidia, the multithreaded 180 shaders found in the Nvidia GeForce 8800 can offer developers "essentially unlimited instruction bandwidth," said Andy Keane, the general manager of the GPU computing group at Nvidia.
G80 obviously doesn't have 180 shaders. 180 is around the 192 people are talking about for G92, but it's a really screwy number (20*9??). Could this just be a typo, or did Andy let something slip?
 
Would it be possible that NV adds per cluster 32 SPs, which are only MADD?

So we had 192 MADDs and 96MULs(6 clusters), @~2GHz this would be close to 1TFlop (~960GFlop).

Imo is MADD-peak-performance at the moment a weak-point of G80.
 
Would it be possible that NV adds per cluster 32 SPs, which are only MADD?
Yes, it is possible in theory to change the ratio of MADD ALUs to SFUs units. And by SFU units, I mean the multipurpose SFU-Interpolators-MUL units, in case that wasn't clear enough.

One possibility is basically "G80 with 24 ALUs per cluster instead of 16, everything unchanged including SFU". Well, except the ROPs, I guess. How many times did I say that already? Yeah, I probably sound like a broken record, I'll ignore those poor ole ROPs from now on! :)
 
Is the ratio of regular ops to SFU ops really high enough for Nvidia to go wider with the main ALU's while keeping the number of 1/4 speed SFU units the same without creating a serious bottleneck there?
 
Is the ratio of regular ops to SFU ops really high enough for Nvidia to go wider with the main ALU's while keeping the number of 1/4 speed SFU units the same without creating a serious bottleneck there?
Think of the SFUs (excluding the MUL) as doing either one attribute interpolation per cycle or one SFU op every 4 cycles.

I honestly don't think you'll need that many attribute interpolation per cycle in future shaders. As the number of ALU ops go up, the number of attributes won't go up as fast, and the number of ops directly dependent on attributes should go down. As for SFU... Well, right now it's hardly being used that much, I think. That's harder to predict going forward, though, but I'd tend to believe it's not too dangerous lowering the ratio.

So yeah, in my mind at least, exclusively increasing the number of MADD ALUs would make a lot of sense. Doesn't mean that's what they did, but it's certainly a possibility. Interestingly, if you had 128 MULs and 192 MADDs, then you'd 'only' need a 2GHz shader core to reach 1TFlop. And that sounds much more reasonable clock-wise, certainly.
 
I think if anything a doubling of the widths is more likely, whilst keeping clocks in the current ballpark (< +20%).

Non-power of 2 widths mucks up the CUDA threading model.

I wouldn't be surprised if 16-wide SIMDs is as big as they ever build. All future iterations will then add clusters, or add SIMDs to each cluster. The latter plays directly with ALU:TEX ratio.

Jawed
 
Hmmm, yeah, increasing the width from 8 to 12 (not 16 to 24, remember, double-pumping! And I was talking at the multiprocessor-level here, which is what matters for warp size) would indeed be problematic in terms of CUDA optimization backwards compatibility. So that would indeed seem like it would exclude the possibility of changing the SFU ratio easily, hmmm.
 
Hmmm, yeah, increasing the width from 8 to 12 (not 16 to 24, remember, double-pumping! And I was talking at the multiprocessor-level here, which is what matters for warp size) would indeed be problematic in terms of CUDA optimization backwards compatibility. So that would indeed seem like it would exclude the possibility of changing the SFU ratio easily, hmmm.

I'm wondering how much backwards compatibility matters at this point in time. My reading of the CUDA docs was along the lines of "G80 is this shape, but all bets are off what the shape will be for future architectures". Given that the general user-base of CUDA at this point are essentially hackers who don't really expect fire-and-forget usability, would such a change be a big deal if it delivers such an increase in usable FLOPs?
 
Well, there's the SFU ALU ratio, and then there's the ALU TEX ratio.
From a load balancing pov, I'd think it'd be better to pull the sampling/cache (at least) out of the cluster. Are we expecting clock improvements in TEX/ROP units? [or do we think that the tex *filtering* units are already "double-pumped" in G80??]

I'm not sure I like going wider, though. If you're really going to tackle gpgpu, you don't want wider. I would think aiming for a square branch set along with a smaller size would be more likely. 16-width sp, 4-width dp has a nice ring to it. One quad of DP, four clocks. It could even fit into the present 8 ALUs (two clocks), if you divide your dp math.... dp SFUs could be interesting....

I would think you decouple your texture units, and just jack the number of math clusters.

I kind of agree with Jawed insofar as raising clock speed would be most un-nvidia-like. It seems like you've got a few "most likelies" to hit "almost 1T". Here are some:
MUL+MADD: 3 x 16ALUs x 12 clusters @ 1.7Ghz
MADD: 2 x 16ALUs x 12 clusters @ 2.5Ghz
2 x 16ALUs x 16 clusters @ 1.8Ghz

The latter would be tricky from an area perspective, even if you removed the TEX units and grew their count slower (or not at all), as there is also the cost of making dp work ['course, that's a swag based on 0 experience, so, take it for what it's worth :)]. From a reasonably-aggressive point of view, I would think the first one would be most likely, although counting that MUL is somewhat disappointing....

Do we know the area breakdown of ALUs + register vs. TEX + cache?

-Dave
 
Well, I could see how a dp/int32/int64 divider, possibly with the ability to do square roots/inverse square roots would be really nice, but I'd imagine that such a circuit would be either quite large or quite slow.

Honestly though, SF in hardware doesn't make much sense to me for higher precisions, since the LUTs needed start getting pretty large. There also starts to be quite a lot of computation (e.g. taylor expansions with a lot of terms relative to sp) involved.

So, I'm hoping that there will be zero transistors spent on DP SFU/ interpolation and texture filtering. My bet is that the rsq instruction will work on dp inputs, but will return a result with just sp accuracy that can be further refined with Newton-Raphson if necessary. For things like trig,exp,ln functions I would try and make sure that the architecture can handle SW implementations of those functions reasonably well - correctly rounded multiply accumulate would be really nice...
 
So that would indeed seem like it would exclude the possibility of changing the SFU ratio easily, hmmm.
Well the SFUs will prolly continue to come in pairs per SIMD, so you can choose whichever ratio suits needs, e.g. 4:1 or 8:1. It's just the granularity of that ratio is rather large. I think the underlying 1:4 ratio related to SFU:Interpolation is all that can be considered as fixed.

It seems likely that a higher MAD:SFU is OK, but the risk there prolly comes from integer operations as they should become more and more important in CUDA type code where addressing operations run in-ALU not in-TMU.

I'm a bit woolly though on which integer operations are restricted to SFU.

Jawed
 
I'm wondering how much backwards compatibility matters at this point in time. My reading of the CUDA docs was along the lines of "G80 is this shape, but all bets are off what the shape will be for future architectures".
For top-line performance I suspect G80's future variants will only use multiples of the "base" that G80 sets. If you change the "shape" then all the boundaries between blocks of threads, when you're modelling the input data layout, move to "non-integer" locations. A 24x24 block is "half" way between a 16x16 block and a 32x32 block.

So far the ALUs in G84 and G86 have the same ratios amongst themselves as in G80. The TMU ratios differ though. It'll be interesting, in future, to see if NVidia plays with ratios down at this end (in comparison with the top-line GPU) because "half way" ratios will really break the coding patterns that work on G8x.

Jawed
 
Well, I could see how a dp/int32/int64 divider, possibly with the ability to do square roots/inverse square roots would be really nice, but I'd imagine that such a circuit would be either quite large or quite slow.

Yeah, that was the thought that ran through my mind when I said "interesting".
I was thinking along the lines of "make reciprocal work" and emulate the rest.... I see little reason to spend a lot on dp transcendentals....

Also, are we getting int64? Hadn't run across that piece of info.
 
Back
Top