NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.
Ah, hang on, could this be due to overflow (combined with sign handling?)? The 32-bit multiplication can't generate a 24-bit-correct overflow result, without doing more work?

Jawed
Integer overflows apparently just wrap. So i don't see any immediate problems when using uint24 that is zero extended to a int32 and operated on like an int32.

edit: Rys forced me to credit him for testing integer overflows :p
 
edit: Rys forced me to credit him for testing integer overflows :p
Lies. You get +/-INF out of the FP ALU on OoB, and out of the blender too, incidentally (FP32 operands all over the place for that). Neatly, the hardware gives you a black pixel for -INF, and a white one for +INF :D
 
Hmm, I'm still confused: if you do the operation in 32-bit mode don't you still have to clear out the bits you don't want, to get back a true 24 bit value? e.g. AND 0x00FFFFFF after the multiplication :???:

That'll slow down the 24-bit operation, making it slower than the 32-bit operation. The 32-bit integer multiply might only take 4 clocks in GT200. There's no indication of the actual performance of 32-bit multiply in the future...

Jawed
 
Hmm, I'm still confused: if you do the operation in 32-bit mode don't you still have to clear out the bits you don't want, to get back a true 24 bit value? e.g. AND 0x00FFFFFF after the multiplication :???:

That'll slow down the 24-bit operation, making it slower than the 32-bit operation. The 32-bit integer multiply might only take 4 clocks in GT200. There's no indication of the actual performance of 32-bit multiply in the future...

Jawed

Yes i considered zeroing the top bits too. If you wanted to do it all fast you'd have hardware just set them to zero when you're doing a int24 op, which should be trivial to do IMO. It will cost you maybe an additional gate delay in the int32 case. So if they are really tight on timing there maybe they don't want to do that. If it's single cycle, ok maybe they are tight on timing. If it's 2 or more cycles, it doesn't seem to me like it should be timing constrained. Might even cost no extra delay if you already have a zeroing/reset capability on all the output registers which is very likely, you only have to adjust the control logic to just zero out the top bits in the int24 case.

Also i guess if you already have faster int32 performance, would you even care about int24? So they might be doing it the AND way like you said, which would make it slower. But if they did that in hardware i'm not sure it would cost them less than just zeroing out the top bits...
 
Last edited by a moderator:
Where are we sitting right now in terms of where we think this is going?

Judging from the latest rumors:

I'm thinking a roughly 50% upgrade from G92 with a 512-bit bus. 192sp with a 512-bit bus. On 65nm nothing else makes sense, there isn't a lot of plausible scenarios given G92 on the same process is already so big. This idea has been floating through chiphell since the inception of those rumors and it does seem the most likely.

Possible interpretations of 384sp could mean a couple things:

1. 192sp X 2 MADD = 384 MADDs. I see this as HIGHLY probable. Perhaps MUL is excluded, as it's not used for general shading, but rather for GPGPU. Who knows, maybe the MUL will gets a new name, like the "on-die physX engine" or something else marktastic™. It's very possible many new enhancements (through CUDA) will use the MUL (other than just SF), so I don't think it's crazy to look at it this way; as nvidia starting to separate the two. IE 384SP, but computational power of >1TB because the MUL is used for CUDA and presumably PHYSX.

2. 2x192sp = 384sp. Perhaps an SLI or 45nm GX2 product. You know with a die this big they will shrink it down, as they did G80, and repackage it along with a GX2 version. As R600/RV670 and G80/G92 as testaments, I believe nvidia will continue the Intel mack-daddy tick/tock R&D.

TANGENT =

On the other hand, I think AMD will pursue their route through power consumption, die size, and a balanced arch that can scale through multiple chips. With an optimal goal of succeeding with one chip that could span across all markets, using 1/2/3?/4 cores. The main object here would be to keep power consumption to their respective markets. 1 die card = <75W, 2 die <150W, 4 die < 300W, or as power connectors go: board power, one pci-e 6-pin, and one six pin and one eight pin. You can already see this pattern emerging with RV670 and it's 105W/190W designs, and furthermore perfected with a two-die system of RV770 supposedly using 135/250W, or a roughly 20W difference between cards (same as RV670 designs). I wouldn't be surprised to see R800 use a finished four core model shooting for 70W/120W/190W?/260W. A 256-bit mem controller is also a great place to start, as with using different memory it can scale from low (GDDR2) to high (GDDR5), with acceptable ram buffers for all (256MB, 512MB, 1GB,2GB?) to allow for appropriate buffer in a multi-gpu situation.

/TANGENT

At any rate...

If you figure g92's 334mm2 x 1.5, the very dirty and non-mathematically proportionate way, you get 501mm2, or roughly what we should be hearing to expect (a little bit bigger than G80?). We hear greater than R600 in die size, and currently perhaps at 310W power usage (from SSX in Taiwan, I don't know if he' reliable, but I seem to recall that name). It just seems to line up after deducting some die space for things that would not be there for redundancy, or different units using different amounts of transistors/space.

Also, 192x3 = 576. To acquire 1TF, it would require a shader speed of <1750mhz. To me this sounds very realistic, although I could see how it may require some juice to an already large die. G80 was a huge die of most likely comparable size, and it wasn't incredibly power hungry, so there could be room there to play with. Either way, it's more probable than seeing a die this big running 2000mhz shaders, as we see in the original rumor.

Irregardless, I think the main point we recognize is that the current nvidia architecture is severely bandwidth limited, as the 8800GTX could have even used more bandwidth. Even with an increase of only 50% in the shaders, a somewhat massive increase in performance could come not only from raw units, but better overall efficiency using the massive bandwidth from the 512-bit bus. A 512-bit bus (Allowing for a 1GB buffer) should help with multi-gpu scenarios as well.

It's also worth noting with a design like this would bode well for a "GTS" part. Using the same formula as the 8800GTS (rev1) It would contain 144 shaders, and a 384-bit bus with 768MB of memory...something that looks eerily similar to a part we've seen before, using roughly the same amount of die space. The differance, unlike it's brother, it would have increased bandwidth because of use of faster ram (at least 2400mhz GDDR3/4, compared to G80's 2000mhz).
 
Last edited by a moderator:
If INT32 is now faster, that obviously implies we've moved to a FP40 unit. One thing I'd find relatively interesting is faster-than-FP64 "single-extended precision" computations (>=43-bit; 32-bit mantissa). Might be an interesting compromise; you could expose that in non-Tesla products for people who want more precision even. Although I'm very skeptical they'll go down that road.

Another theory, of course, is that FP40 has nothing to go with GT200 and that notice is really there for the DX11 chip in 18-24 months. Who knows. Does anyone know if a FP43+ unit would allow them to do FP64 in less than 4 cycles? (afaict it wouldn't but I'm not sure).

My expectations for GT200 remains 32 ROPs, 96 TMUs and 384 SPs, with each SP only having half the SFUs/interpolators of those in G8x/G9x. BTW, regarding SIN/COS taking '32 cycles', I'm starting to wonder if that might not be caused by the ALU operation to put them in range; perhaps in CUDA, they need more cycles for that to keep precision more acceptable at higher input values? Or maybe in OpenGL/D3D too but nobody noticed because it's done in parallel (although I think I did test that way back then hmm). I wonder if single-cycle INT32 MUL would help... I can't really see how though, heh. Anyone?

Also, I really don't see why people think ALUs would be bandwidth limited. ALUs don't really take much bandwidth by themselves (mostly constant fetches); TMUs and ROPs do. Since ROPs increase linearly with the bus width on NV's architecture, that's not a problem. As for TMUs, 96 TMUs at 600MHz fetching DXT5 textures would only take as much bandwidth as is available on the 8800 GT with 100% utilization (and 0% ROP utilization) and classic 'cache is for bilinear locality only' situations. Certainly once you move away from compressed formats, then you generally are bandwidth starved, although depending on your use case your caching might (or might not) be more efficient.
 
Or maybe in OpenGL/D3D too but nobody noticed because it's done in parallel (although I think I did test that way back then hmm). I wonder if single-cycle INT32 MUL would help... I can't really see how though, heh. Anyone?
I tested them back then (and just now, just to make sure I wasn't seeing things) and they're still 1/4 rate using D3D9 and D3D10 under Vista (G80GL). No analysis of output though, to see if I'm losing precision anywhere or anything like that.

I'll see if I can't do a roundup of FP and INT perf under D3D10 with everything from R600 and G80 onwards in the near future, just to double check basic execution performance and catch up with what's going on with hardware and drivers.
 
Yes i considered zeroing the top bits too.
:oops: Just noticed that the definition of this function is:

__mul24(x, y) computes the product of the 24 least significant bits of the integer parameters x and y and delivers the 32 least significant bits of the result. If any of the most significant 8 bits of either x or y are set, the result is undefined.

So the "overflow" into the 8 most significant bits of the 32 bit result is returned. Ugh. Dunno what to think now.

Jawed
 
:oops: Just noticed that the definition of this function is:


So the "overflow" into the 8 most significant bits of the 32 bit result is returned. Ugh. Dunno what to think now.

Jawed

Oh, that makes it really easy then. No need for the zeroing anymore. Even less reason to make it slower now. I am completely confused as to why it is slower then. Maybe we just overlooked something...

edit: maybe they just added that line in for the lulz. i know i would find it funny if i did that and saw people like us desperately trying to figure out the logic behind the decision :p
 
Last edited by a moderator:
BTW, regarding SIN/COS taking '32 cycles', I'm starting to wonder if that might not be caused by the ALU operation to put them in range; perhaps in CUDA, they need more cycles for that to keep precision more acceptable at higher input values?
The fast SIN/COS definition gives two ranges for error bounds, -pi...pi and outside of that. EXP only has one range.

Or maybe in OpenGL/D3D too but nobody noticed because it's done in parallel (although I think I did test that way back then hmm).
If the operand is tested and found to be in the range -pi...pi perhaps it's a 16 cycle operation? Perhaps the CUDA documentation just generalises execution to 32 cycles even though sometimes it comes out as 16 cycles?

I can't help thinking that the graphics versions of these functions are just lower precision and so are able to run in 16 cycles. Is there a difference in required precision of transcendentals in SM3 and SM4?

Jawed
 
edit: maybe they just added that line in for the lulz. i know i would find it funny if i did that and saw people like us desperately trying to figure out the logic behind the decision :p
Well considering the confusion over the throughput of SIN/COS/EXP we're obviously too easy :LOL:

Jawed
 
We hear greater than R600 in die size, and currently perhaps at 310W power usage (from SSX in Taiwan, I don't know if he' reliable, but I seem to recall that name). It just seems to line up after deducting some die space for things that would not be there for redundancy, or different units using different amounts of transistors/space.
WRT to Power: I'd really like so see a similar Setup to G80 with added onboard Hybrid-Power-Capability. That would really be a great thing to add to any high-end-card.

WRT to die-size: According to my (obviously not exact) measurements, the transistor densitiy decreases with smaller and increases with larger chips. And not to a minor difference, i'd say. Again according to my e-ruler ;) G92 is about 2,35M Transistors/mm² and G94 is at only 2,15MT/mm². So a larger chip might just be in the 2,5-2,6MT/mm² range. Don't know what to make of this, though.


@Jawed:
Thx for the explanation. But do not pro-grade cards usually appear only some time after the gaming-models?
 
Perhaps some of the density differences are related to the amount of connections present on the perimeter of the die.

Larger die have more internal surface area to devote to transistors, while off-chip connections have physical constraints that keep them from scaling as well.
Small chips would have more perimeter per unit of die area than the larger ones, which might explain some of the difference.
 
Thx for the explanation. But do not pro-grade cards usually appear only some time after the gaming-models?
A month or so? Presumably they're also good at mopping up stock (as the chip nears the end of its life) and prolly hold their price much better over their lifetime. Maybe worth starting a thread on this subject...

Jawed
 
Also, I really don't see why people think ALUs would be bandwidth limited. ALUs don't really take much bandwidth by themselves (mostly constant fetches); TMUs and ROPs do. Since ROPs increase linearly with the bus width on NV's architecture, that's not a problem. As for TMUs, 96 TMUs at 600MHz fetching DXT5 textures would only take as much bandwidth as is available on the 8800 GT with 100% utilization (and 0% ROP utilization) and classic 'cache is for bilinear locality only' situations. Certainly once you move away from compressed formats, then you generally are bandwidth starved, although depending on your use case your caching might (or might not) be more efficient.

Thanks for this.

I wasn't so much inferring ALUs were bandwidth limited, but rather a small increase in alus may not be indicative of performance if bandwidth were dramatically increased, and other parts such as the ROP/TMUs were kept better fed. I imagined we can deduce from comparing the G92 to G94 that sp are not part of the problem, nor are TMUs, as the g94 has half of g92. If the problem doesn't lie in bandwidth, why such terrible scaling between those two chips? Wouldn't the only options be ROPs and bandwidth, in which G80 has more of both and performs notably better at the same clock speed and amount of SPs compared to G92? Furthermore, using the pconline review, comparing 9800GTX to 8800GTS 512, otherwise identical, we see a 2-3x (or 7x in CoJ) increase in performance over what the core/shader speed increase alone would provide, ~4%. 9800GTX's memory clock is 20% higher, and the average performance is ~10% greater. How else to explain this other than bandwidth? Would the small bump in ROP speed help that increase so dramatically?

I agree with 32 ROPs and 96 TMUs, although was not clear on total bandwidth needed to saturate them.

In my mind 192sp, 32 ROPs, and 96 TMUs on a 512-bit bus would be a much better balance for the architecture, and fits with the die size requirements.
 
Last edited by a moderator:
In my mind if they go for a 8*64MC/512bit bus, it's will be mostly because in order to get ~120GB/s bandwidth it's the safest bet; they don't seem to want to touch GDDR4 and in order to get that kind of bandwidth with GDDR5 on a smaller bus they'd need very high speced ram, which won't be that easy to find for its release.

32 ROPs in such a case are a result of the above necessity and not IMO a necessity to have as many ROPs.

In my mind 192sp, 32 ROPs, and 96 TMUs on a 512-bit bus would be a much better balance for the architecture, and fits with the die size requirements.

Which die size requirements if I may ask?
 
If INT32 is now faster, that obviously implies we've moved to a FP40 unit. One thing I'd find relatively interesting is faster-than-FP64 "single-extended precision" computations (>=43-bit; 32-bit mantissa). Might be an interesting compromise; you could expose that in non-Tesla products for people who want more precision even. Although I'm very skeptical they'll go down that road.

I remember something about a 3dlabs chip where the vertex processing used slightly higher than fp32 precision. I really don't remember the details, but they had some zoomed screen shots up of scenes with huge view distances and the things in the far distance did look better than the comparison shots from (I believe) NVidia hardware they had up. So *if* there is an fp40 unit, maybe some kind of fixed function thing requiring greater than fp32 precision got moved to the shader array? Also, to get quarter speed fp64 don't you need a multiplier that has a few bits more than what you need for fp32? Otherwise, maybe it's just a case of good int32 perf. being required for something targetted at more general purpose loads.
 
I imagined we can deduce from comparing the G92 to G94 that sp are not part of the problem, nor are TMUs, as the g94 has half of g92.
I don't think we can deduce that; for all we know, the vast majority of the performance increase between G92 and G94 might be the ALUs, or it might be the TMUs. Or it could be both in the exact same ratio or in an arbitrary ratio on average. Unless you toy with the shader clock, you can't really conclude anything here.

Given that G92 seems quite bandwidth limited based on the G92-GTS -> G92-GTX numbers you just gave, then that would obviously tend to imply most of the gain comes from SPs, not TMUs. G92 vs G94 is also a big part of why I started taking 96 TMUs/384 SPs more seriously instead of 128 TMUs/384 SPs. Given that GT200 will be aimed at very high resolutions though, it would be nice if bilinear (at least for INT8) really was full-speed (instead of ~0.8x the expected performance). Since obviously the ratio of bilinear-to-trilinear pixels is higher at these resolutions...

In my mind 192sp, 32 ROPs, and 96 TMUs on a 512-bit bus would be a much better balance for the architecture, and fits with the die size requirements.
In response to this and GDDR3 vs GDDR5, I think people tend to forget what NVIDIA and AMD want to do: they want to sell you a chip for the most money possible. Any dollar that goes to the PCB, the memory chips or the cooler are effectively lost. But if you can improve performance without affecting these parts too much (or even making them cheaper), that tends to be worthwhile as it likely means you can sell your chip for more money.

Better compression technology or even embedded memory are examples of that kind of thing. And so are ALUs; they don't increase the memory costs at all, and only increase the PCB and cooler costs a little bit due to the fact they consume power and dissipate heat. Overall though, a moderately higher ALU ratio does tend to improve gross profit AFAICT. Same for GDDR3; if you can stick to an older and cheaper technology, you have an opportunity to increase the amount of money attributed to your chip instead. And that's obviously very desirable...

psurge said:
So *if* there is an fp40 unit, maybe some kind of fixed function thing requiring greater than fp32 precision got moved to the shader array? Also, to get quarter speed fp64 don't you need a multiplier that has a few bits more than what you need for fp32? Otherwise, maybe it's just a case of good int32 perf. being required for something targetted at more general purpose loads.
Triangle Setup comes to mind (although its programmability wouldn't be exposed), but my expectation for that is they'll reuse a simple technique designed by their handheld team; determine the precision requirements first, then use either FP32 or FP64 depending on that.

Also, to get quarter speed fp64 don't you need a multiplier that has a few bits more than what you need for fp32? Otherwise, maybe it's just a case of good int32 perf. being required for something targetted at more general purpose loads.
I talked about this with farhan, and indeed you need a 27-bit MUL mantissa instead of a 24-bit mantissa for FP64. You also need a 53-bit mantissa for the ADD. Still, going from 27-bit to 32-bit for the MUL is hardly free, but maybe they figured single-cycle INT32 MULs were required after talking to game developers. And given that the overall ALU is more expensive now, obviously the relative cost is now lower. I can't really think of any fixed-function workload that'd benefit from single-cycle INT32 MULs though, hmm...
 
I remember something about a 3dlabs chip where the vertex processing used slightly higher than fp32 precision. I really don't remember the details, but they had some zoomed screen shots up of scenes with huge view distances and the things in the far distance did look better than the comparison shots from (I believe) NVidia hardware they had up. So *if* there is an fp40 unit, maybe some kind of fixed function thing requiring greater than fp32 precision got moved to the shader array? Also, to get quarter speed fp64 don't you need a multiplier that has a few bits more than what you need for fp32? Otherwise, maybe it's just a case of good int32 perf. being required for something targetted at more general purpose loads.
I believe it was 36 bit floating point.
 
Well, things seem that we will have to wait for GT200 "a little" longer :( A few weeks ago there were some rumours that NVIDIA will show something BIG on Computex in June but by now we know it is probably another G92-SKU but in 55nm process and called D10U.
It`s strange codename because it may suggest that GF10 will be based on G92 ("B" version in 55nm) GPUs again. I hope it`s not.
So a should i ask - what`s going on with NVIDIA? Rv770 specs are impressive and it should be much faster than Rv670 so why they decided to push 55nm G92 against it where the only difference will be probably higher clocks than current 65nm G92?
 
Status
Not open for further replies.
Back
Top