Jawed
Legend
I suggested that because G80's current design has a hell of a lot of pipeline arbitration/sequencing. If the SIMDs get twice as wide and you change nothing else, you "halve" the scheduling cost. (Wider SIMDs I imagine do cost a bit more - it's not completely free to double their width.)I'm not sure I like going wider, though.
Generally, no, because of dynamic branching and because it makes thread/block organisation more coarse-grained.If you're really going to tackle gpgpu, you don't want wider.
The alternative is simply to put more SIMDs into a cluster, but the scheduling cost goes up.
As far as CUDA problems go, if you double or quadruple the performance of the GPU, then you might argue that the really fine-grained thread-sizing of the past (G80, when looked at from the point of view of G200) is just more complexity than you need. There was a time when pixel shader pipelines came in 1s and 2s, not quads...
DP is definitely a spanner in the speculative works. Is NVidia aiming for DP performance that's 1/10 SP or 1/2? The gulf between the two is vast. 1/10th SP performance would still put CUDA ahead of a quad-core CPU.I would think aiming for a square branch set along with a smaller size would be more likely. 16-width sp, 4-width dp has a nice ring to it. One quad of DP, four clocks. It could even fit into the present 8 ALUs (two clocks), if you divide your dp math.... dp SFUs could be interesting....
The TMUs are already effectively decoupled.I would think you decouple your texture units, and just jack the number of math clusters.
What's not known is how PDC is used for non-ALU type work in graphics mode (not CUDA mode). I wonder if PDC is used for vertex buffer caching, for example. Is vertex addressing performed by the TMU-TA units? So there might be sizing constraints there for "peak" performance.
To spell out what I mean:I kind of agree with Jawed insofar as raising clock speed would be most un-nvidia-like. It seems like you've got a few "most likelies" to hit "almost 1T". Here are some:
MUL+MADD: 3 x 16ALUs x 12 clusters @ 1.7Ghz
MADD: 2 x 16ALUs x 12 clusters @ 2.5Ghz
2 x 16ALUs x 16 clusters @ 1.8Ghz
2x16-SIMDs x 8 clusters @ 1.8GHz = 922GFLOPs
SIMDs with "not much extra scheduling hardware" will scale pretty nicely, especially as ALUs are fairly small in area. It's the associated memories that I think NVidia will spend time on, particularly the register file and PDC - both of which strike me as uncomfortably small.
Jawed