NVIDIA confirms Next-Gen close to 1TFlop in 4Q07

I'm not sure I like going wider, though.
I suggested that because G80's current design has a hell of a lot of pipeline arbitration/sequencing. If the SIMDs get twice as wide and you change nothing else, you "halve" the scheduling cost. (Wider SIMDs I imagine do cost a bit more - it's not completely free to double their width.)

If you're really going to tackle gpgpu, you don't want wider.
Generally, no, because of dynamic branching and because it makes thread/block organisation more coarse-grained.

The alternative is simply to put more SIMDs into a cluster, but the scheduling cost goes up.

As far as CUDA problems go, if you double or quadruple the performance of the GPU, then you might argue that the really fine-grained thread-sizing of the past (G80, when looked at from the point of view of G200) is just more complexity than you need. There was a time when pixel shader pipelines came in 1s and 2s, not quads...

I would think aiming for a square branch set along with a smaller size would be more likely. 16-width sp, 4-width dp has a nice ring to it. One quad of DP, four clocks. It could even fit into the present 8 ALUs (two clocks), if you divide your dp math.... dp SFUs could be interesting....
DP is definitely a spanner in the speculative works. Is NVidia aiming for DP performance that's 1/10 SP or 1/2? The gulf between the two is vast. 1/10th SP performance would still put CUDA ahead of a quad-core CPU.

I would think you decouple your texture units, and just jack the number of math clusters.
The TMUs are already effectively decoupled.

What's not known is how PDC is used for non-ALU type work in graphics mode (not CUDA mode). I wonder if PDC is used for vertex buffer caching, for example. Is vertex addressing performed by the TMU-TA units? So there might be sizing constraints there for "peak" performance.

I kind of agree with Jawed insofar as raising clock speed would be most un-nvidia-like. It seems like you've got a few "most likelies" to hit "almost 1T". Here are some:
MUL+MADD: 3 x 16ALUs x 12 clusters @ 1.7Ghz
MADD: 2 x 16ALUs x 12 clusters @ 2.5Ghz
2 x 16ALUs x 16 clusters @ 1.8Ghz
To spell out what I mean:
2x16-SIMDs x 8 clusters @ 1.8GHz = 922GFLOPs

SIMDs with "not much extra scheduling hardware" will scale pretty nicely, especially as ALUs are fairly small in area. It's the associated memories that I think NVidia will spend time on, particularly the register file and PDC - both of which strike me as uncomfortably small.

Jawed
 
To spell out what I mean:
2x16-SIMDs x 8 clusters @ 1.8GHz = 922GFLOPs

right, which is effectively 16 x 16 clusters, with less scheduling hardware. Is scheduling really that much area? I would think, as you say:

It's the associated memories that I think NVidia will spend time on, particularly the register file and PDC...

The register file, seems to me, is the "big" area killer. Is this where nVidia is putting edram to use? Cell size drops fairly significantly, although presumably multi-porting isn't any cheaper (?). I suppose one way around that is to jack the memory speeds up to Cell/4Ghz speeds.... If you can reduce the per-ALU area of the registry by two, I bet that's a big win....

-Dave
 
Well the SFUs will prolly continue to come in pairs per SIMD, so you can choose whichever ratio suits needs, e.g. 4:1 or 8:1. It's just the granularity of that ratio is rather large. I think the underlying 1:4 ratio related to SFU:Interpolation is all that can be considered as fixed.
Yeah, you're restricted to integer ratios basically. If you keep each multiprocessor as having only 8 double-pumped ALUs though, the only thing you could do is halve the ratio. That's a bit extreme, to say the least! Another possibility would be to make each multiprocessor 16-wide, and remove the "PS/CUDA has twice the warp size" hack. According to those David Kirk uni courses, they might even expose that under CUDA, and the way I understood what he said is that it would hurt nothing else than latency tolerance, since you support a lower number of concurrent 'blocks of threads'.

Jawed said:
I'm a bit woolly though on which integer operations are restricted to SFU.
Actually, the integer operations cannot be run on the SFU, only on the ALU, afaik. All integer operations are single-cycle, except the FP32 MUL which, as far as I can see, is probably emulated via bitshift instructions as it takes 4 cycles. R6xx on the other hand runs FP32 MUL on the "big ALU", which presumably supports a 32-bit mantissa thus (which is much more expensive, but since it's just for one ALU on five, it's probably fairly negligible). That makes its integer throughput similar to G80's, but it can run operations in the four smaller ALUs at the same time, while the G80 must mobilize all of its power to reach that throughput.

dnavas said:
Also, are we getting int64? Hadn't run across that piece of info.
If you have 64b registers, integer operations of any precision whatsoever, and logic ops including bitshifts, then you can support integer operations of any given number of bits easily. Heck, there's nothing preventing you from doing it even without 64b registers, heh. I wouldn't expect G9x to do anything above INT24 (for MUL) and INT32 (for everything else) in a single cycle, however.

dnavas said:
Is NVidia aiming for DP performance that's 1/10 SP or 1/2
If you double your mantissa width, performance/mm² of FMUL tends to be 1/4th, apparently. Letting a FP32 ALU do that also tends to be 1/4th. I'll admit to have no idea whether FP64 ADD might be 1/2th though, not that it'd really matter I'd presume. And that might be 'fun' in terms of scheduling! :)

dnavas said:
The register file, seems to me, is the "big" area killer. Is this where nVidia is putting edram to use? Cell size drops fairly significantly, although presumably multi-porting isn't any cheaper (?). I suppose one way around that is to jack the memory speeds up to Cell/4Ghz speeds.... If you can reduce the per-ALU area of the registry by two, I bet that's a big win....
By being smart, you can have a single-ported register file. I think on the G80, the register file is 16x32KiB of single-ported SRAM. As for eDRAM, I think the problem with that might be that it's too large, since the register file is pretty much per-multiprocessor on G8x. I'd presume the typical bus width you'd have on that might be too large too... Also, assuming my calculations are right, a 512KiB register file should take only 20-30mm² on 90nm. I might be assuming something incorrectly there though, hmm. Gah, I'd kill for a die shot.
 
8192 4-byte registers for each of the 16 SIMDs = 512KB of register file.

16KB of PDC per SIMD = 256KB of PDC.

Assuming they're all done with 6 transistors per bit, that's about 38M transistors, excluding banking for both and porting for the register file.

Register file bandwidth is not as stratospheric as our original discussions on this subject, many moons ago, seemed to consider (i.e. it seems unable to sustain 3x fp32 operands per pixel per clock). Additionally, from the CUDA documentation, there are bank conflicts associated with PDC access, implying that there is no porting sophistication there at all. e.g. PDC can be reduced to producing 1x fp32 per clock instead of 16x fp32s per clock.

If there's any growth in the size of either, then you have to doubt there's much chance the bandwidths will improve. I dunno, PDC is obviously "cheaper" memory than the register file, but it's peculiar there's relatively little of it. You could argue that this is because it has little benefit for graphics.

So one of the problems here is to try to weigh graphics versus CUDA functionality. Perhaps the introduction of DP is the only concession to CUDA and most of the rest of the changes are merely about sizing-up the GPU for graphics performance, retaining the ALU:TEX ratio of G80, for example. Having said that, it seems unlikely to me that DP-support wouldn't affect the ALU:TEX ratio, unless NVidia goes for a 1/10th SP performance DP implementation.

Jawed
 
Register file bandwidth is not as stratospheric as our original discussions on this subject, many moons ago, seemed to consider (i.e. it seems unable to sustain 3x fp32 operands per pixel per clock). Additionally, from the CUDA documentation, there are bank conflicts associated with PDC access, implying that there is no porting sophistication there at all. e.g. PDC can be reduced to producing 1x fp32 per clock instead of 16x fp32s per clock.
Errr, G80 cannot sustain 3xFP32 operands per pixel per clock? Am I missing something? :(
 
Errr, G80 cannot sustain 3xFP32 operands per pixel per clock? Am I missing something? :(
Sorry, what I meant was that from the point of view of the ALU pipeline you can't always get 3 MAD operands from the register file, because of other concurrent accesses to the register file.

e.g. a texture instruction requires a read from the register file. So even though that texture instruction isn't issued within the ALU pipeline, the read affects the register file.

I should have put it in terms of MAD+SF, which is 4 operands. Obviously if there's no SF instruction, then that's less read-pressure. We were talking the other day of read and write rates against the register file, which are complicated by the timings of these three operations.

Jawed
 
If you double your mantissa width, performance/mm² of FMUL tends to be 1/4th, apparently. Letting a FP32 ALU do that also tends to be 1/4th. I'll admit to have no idea whether FP64 ADD might be 1/2th though, not that it'd really matter I'd presume. And that might be 'fun' in terms of scheduling! :)

careful with your quoting -- that was Jawed. :) I'm assuming 1/4 speed (that's where I get a single dp quad/4 sp quads from).

Also, assuming my calculations are right, a 512KiB register file should take only 20-30mm² on 90nm. I might be assuming something incorrectly there though, hmm. Gah, I'd kill for a die shot.

Yeah, I'm shooting in the dark with a bb gun :)

The TMUs are already effectively decoupled.

Err, no, not that I know of. If one cluster doesn't do texturing, and another cluster needs a huge amount of texturing, the TMUs don't get sent to work for the other cluster, afaik.

I'm thinking they go "the other way" -- away from VLIW. Pull the texturing out to a dedicated unit. Pull the SFUs out to a dedicated unit. Texturing requests go in, pixels come out. SFU requests get queued, answers come out. It simplifies the compiler, as the entire chip is load-balanced. It complicates the heck out of routing. Win some, lose some :) Even so, as unit size decreases, it makes more sense to increase the independence of the units. If you had 16 math clusters, you put four ALU clusters with a single TMU unit (of 'n' capability) and a single SFU unit (of 'n' capability).

And then you get to thinking that with a little *more* scheduling logic, you can make a wider SFU/TEX "vector" from several predicated requests from each of the ALU clusters.

And then you get to thinking that, with just a little more, you could decouple thread dispatch/register access from ALU execution, and actually hit slightly better db performance than SIMD width would indicate.

And then ... I wake up :)

-Dave
 
Err, no, not that I know of. If one cluster doesn't do texturing, and another cluster needs a huge amount of texturing, the TMUs don't get sent to work for the other cluster, afaik.
It seems TMUs are decoupled from ALUs within a multiprocessor though ,while this was not the case on G7x.
 
Sorry, what I meant was that from the point of view of the ALU pipeline you can't always get 3 MAD operands from the register file, because of other concurrent accesses to the register file.
Ah yeah, you're thinking in terms of reading operands. Unless I'm missing something, for operands in general, it's 4 operands/clock though (3 read+1 write for a MAD) :) And except when using the SFU for something else than interpolation, that wouldn't be a bottleneck (once again, as per the current theory, which I'm not 100% confident in but it is the only half-logical one I've got so far).

Really, it'd make a LOT more sense to me if the register file banks were 1 read+1 write per clock, rather than 1 read OR 1 write per clock, since that'd simplify things greatly for the TMUs etc... That obviously doesn't collide very well with what I was seeing for the SFU/MUL though. Sigh, this can be quite frustrating.

Hmm, actually, this makes me realize I probably should just be testing performance with a 1x1 texture when having just enough ALU ops not to be TEX-limited. That could be interesting. And I also should bother testing on newer drivers eventually...
 
Hmm, actually, this makes me realize I probably should just be testing performance with a 1x1 texture when having just enough ALU ops not to be TEX-limited. That could be interesting. And I also should bother testing on newer drivers eventually...
You could just use a constant (not a literal) in SM4. As long as the compiler doesn't compile-out your synthetic sequence of instructions into a "nothing".

Jawed
 
Ah yeah, you're thinking in terms of reading operands.
Ah, actually I forgot David Kirk talked about register bank conflicts. See slide 25 of Lecture14-15.

But register reads are really free if you think of them as latency hidden accesses, they don't cost any other kinda slots, resource slots. However you can have delays if you have any kind of conflicts, like read after write dependencies if you have an instruction that writes to a register and the subsequent instruction reads from a register, it is not free to actually commit that value to the register and then read it back, so if you don't have any other warp in between to hide that latency you'll have to wait, you'll have a conflict. And I don't know exactly how many cycles that is, but you can figure that out if you need to know. It's better to just avoid it.

The other issue that we have [...] we do have parallel access that different SPs can access all the registers, because the scaling varies from individual mapping of warps to the streaming multiprocessor. So you actually can have bank conflicts because it isn't really feasible for us to build a register file that is, let's see, 16-way 4-ported so that we can dual-issue to all the multiprocessors out of the same register file bank. So it is banked and you can have conflicts from that. I'm telling you this not really because you can do that much about it at your program level, but just so that you know this is one of the things that can cause mysterious slow downs underneath. If you're having a slow down that you don't understand you can look at the compiler output and see what the behaviour is and figure this out. You shouldn't have too much problems with this assuming the compiler does all the right things for you.
I think register file bandwidth might be the underlying reason for two "half-warps" (16-wide, 2 clocks in the ALU pipeline) being coalesced into a "real warp" of 32.

I think it's worth observing that for vertex shading (which uses "half-warps") the code presumably tends towards using vec4 operands, which means that register file bank conflicts are prolly extremely unlikely/easy to work-around.

Jawed
 
Last edited by a moderator:
Thinking about PDC bandwidth, mostly, I think makes it potentially unlikely that SPs would double in number per SIMD - I think PDC bandwidth is one of those "fundamental" numbers in G80 and changing it this early might be counterproductive in terms of moving CUDA apps.

So, erm, more SIMDs per cluster. That's a more finely-grained scaling than SIMD width, e.g. Arun's 3x SIMD, 24 SPs per cluster is workable.

Jawed
 
3 times of G80 Raw Performance must be implausible for Nvidia to follow the track,or some problem occured in G80 will has been resolved in G92 .Is'nt possible that Next Refresh of G92 will be SIMD based on High tranny Budget on MIMD ?
Dreams on.


G70--G71
G80--G92
 
3 times of G80 Raw Performance must be implausible for Nvidia to follow the track,or some problem occured in G80 will has been resolved in G92 .Is'nt possible that Next Refresh of G92 will be SIMD based on High tranny Budget on MIMD ?
Dreams on.


G70--G71
G80--G92

IMO G80-->G92 seems more like NV40-->G70....
 
I'm expecting NV40 -> G71 myself. It's a full node jump so who knows what other trickery they've managed to introduce during the 65nm transition.
 
I'm expecting NV40 -> G71 myself. It's a full node jump so who knows what other trickery they've managed to introduce during the 65nm transition.

Well.... You are right.... NV40-->G70 was the same architecture with some tweaks but without any clockspeed increases (400Mhz vs 430Mhz).... G80-->G92 is very likely to be the same architecture jump like NV40-->G70 with much higher clocks (imo at least 750Mhz which is very possible in 65nm)....
 
http://www.pcdvd.com.tw/showpost.php?p=1080094055&postcount=18
http://www.pcdvd.com.tw/showpost.php?p=1080094597&postcount=29

Rumor from Taiwan Web site

65nm , 1000M
Two times of Raw performance of G80
Die Size :3/4 of Die Size of G80
Shader Clock of G92 is much higher than that of G80.
3 or 2.5 times of Factual Performance of G80

Due Launch Date : End of Nov

So it seems to me that 1.5 times of TCP Scale + Bug Fixed with MAD instead + More Shader clock increase= 3 times of Factual performance of G80.
 
http://www.pcdvd.com.tw/showpost.php?p=1080094055&postcount=18
http://www.pcdvd.com.tw/showpost.php?p=1080094597&postcount=29

Rumor from Taiwan Web site

65nm , 1000M
Two times of Raw performance of G80
Die Size :3/4 of Die Size of G80
Shader Clock of G92 is much higher than that of G80.
3 or 2.5 times of Factual Performance of G80

Due Launch Date : End of Nov

So it seems to me that 1.5 times of TCP Scale + Bug Fixed with MAD instead + More Shader clock increase= 3 times of Factual performance of G80.

Is it possible 2 times raw performance=3 or 2.5 times factual performance (you mean real world dont`t you??)
 
http://www.pcdvd.com.tw/showpost.php?p=1080094055&postcount=18
http://www.pcdvd.com.tw/showpost.php?p=1080094597&postcount=29

Rumor from Taiwan Web site

65nm , 1000M
Two times of Raw performance of G80
Die Size :3/4 of Die Size of G80
Shader Clock of G92 is much higher than that of G80.
3 or 2.5 times of Factual Performance of G80

Due Launch Date : End of Nov

So it seems to me that 1.5 times of TCP Scale + Bug Fixed with MAD instead + More Shader clock increase= 3 times of Factual performance of G80.

25% smaller die size, 30% increase in transistor count, 28% decrease in process size. Does that add up?
 
Back
Top