NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.
Something I've been told: each SM has a dedicated double-precision MAD unit, so there's 30 in total. That's a surprise, 1/12th of single-precision, way less than I was expecting, 78 GFLOPs :oops:

Maybe that's what Rys was referring to.

That's another benefit of VLIW vs "scalar" -> Much easier to reconfigure the ALU's to do different things on the fly.

Either way, isn't 78 GFlops kinda weak even compared to CPUs?
 
Maybe that's what Rys was referring to.

That's another benefit of VLIW vs "scalar" -> Much easier to reconfigure the ALU's to do different things on the fly.
I think in ATI's case the implementation of dot-product (which naturally requires four ALU lanes to work together) is a key part of this. A lot of the wiring "was already there", I guess.

Either way, isn't 78 GFlops kinda weak even compared to CPUs?
Yeah, though maybe NVidia has a bandwidth advantage.

I don't know what Nehalem is meant to be capable of :oops: But that'll be what's on people's radar when they ponder NVidia for double-precision scientific GPGPU.

Larrabee, if it's 2 TFLOPs single-precision, could be half that in double-precision.

Jawed
 
Just GPGPU effectively.

It will eventually make it into D3D, but it seems like a low priority as there really isn't much (any?) use for it.

Jawed

I'm no EE (nor Dev) so I may be way out in left field on this one, but is it possible that DP math may be more beneficial to physics performance and/or complexity?
 
I think in ATI's case the implementation of dot-product (which naturally requires four ALU lanes to work together) is a key part of this. A lot of the wiring "was already there", I guess.


Yeah, though maybe NVidia has a bandwidth advantage.

I don't know what Nehalem is meant to be capable of :oops: But that'll be what's on people's radar when they ponder NVidia for double-precision scientific GPGPU.

Larrabee, if it's 2 TFLOPs single-precision, could be half that in double-precision.

Jawed

IIRC Sandy Bridge is the next big jump in CPU DP compute power, and it's been pegged at approximately 200GFLOPs.
 
I'm no EE (nor Dev) so I may be way out in left field on this one, but is it possible that DP math may be more beneficial to physics performance and/or complexity?
What like the precise impact location of a bullet over a distance of 1km? Seems unlikely.

Avoidance of rounding errors during multi-body collision? I suspect you'd need an incomputably large number of objects (in real time) interacting for that to be the case. At least for a few years.

Jawed
 
What like the precise impact location of a bullet over a distance of 1km? Seems unlikely.

Avoidance of rounding errors during multi-body collision? I suspect you'd need an incomputably large number of objects (in real time) interacting for that to be the case. At least for a few years.

Jawed

LOL, ok, thanks for clearing that one up :p

I'm grasping at straws here, hoping for a way to put all that new compute power to use inside our own home PCs...
 
DP would be useful for SATs and to filter exponential shadow maps on 'long' ranges without employing log-filtering.
 
DP would be useful for SATs and to filter exponential shadow maps on 'long' ranges without employing log-filtering.
Interesting.

What sort of DP-ALU:TEX ratio would you need for decent performance?

Presumably the creation of a SAT requires only a very short shader with a couple of DP ops. Filtering would be pretty hairy though, wouldn't it?

Jawed
 
Okay, seems we've had a dearth of performance leaks the last couple days. Anybody want to call the first leaked review??

If I recall past launches, we usually get one the weekend before NDA expires.
 
Larrabee is an area efficient design, not a desktop processor ... so I wouldn't expect half speed DP, expect quarter speed DP (at least as far as MAD is concerned). The 8 core Sandy Bridge would probably just get to 200 GFLOPs, assuming half speed DP and 2 AVX pipelines per core, in 2010 on a 32 nm process ... that's not really going to close any gaps.

If it's true Nvidia has dedicated DP engines I wonder what happened ... did they just judge ATI as not much of a threat and thought they could simply do it like this and save on development or did they find out about ATI's DP support and felt they had to follow but didn't have enough time to do it efficiently this generation? Regardless, I'm sure their next part will do it just fine without wasting area.
 
Larrabee is an area efficient design, not a desktop processor ... so I wouldn't expect half speed DP, expect quarter speed DP (at least as far as MAD is concerned). The 8 core Sandy Bridge would probably just get to 200 GFLOPs, assuming half speed DP and 2 AVX pipelines per core, in 2010 on a 32 nm process ... that's not really going to close any gaps.
Larrabee was projected to hit 1 TFLOP DP in Intel's slides, though the numbers pointed to an FMAC to hit it.
Most of the discussion points to just one vector unit (per core) with registers wide enough for 16 SP values.

If DP is quarter speed, Larrabee would hit 4 TFLOPs SP, though nothing so far indicates it has the massive operand bandwidth or the necessary number of vector units necessary for hitting that.
 
Sorry, with that kind of roundabout reasoning based on slides from a very high level presentation on a design which probably wasn't even finalized in any way shape or form I'd rather just trust on my intuition :) I wouldn't expect half speed DP.

PS. of course I didn't expect 1/12th speed DP for NVIDIA either ...
 
Sorry, with that kind of roundabout reasoning based on slides from a very high level presentation on a design which probably wasn't even finalized in any way shape or form I'd rather just trust on my intuition :) I wouldn't expect half speed DP.

PS. of course I didn't expect 1/12th speed DP for NVIDIA either ...

If IBM can get Cell to half speed DP of its SP, I thought Intel would aim for the same. I mean Larrabee is aimed at that sector. If Intel or AMD aren't taunting Intel with all this GPGPU stuff, I'm sure Intel would be content with just CPU.

78 GFLOPS is pretty poor peak for the reported size and power consumption of the board. It couldn't even beat the Cell board.
 
PS. of course I didn't expect 1/12th speed DP for NVIDIA either ...
It might be fairer to think of it as 1/8th of the MAD throughput :p

Now that NVidia has decided to count the MUL, though, I'm inclined not to. After all we're counting 1/5th MAD rate on ATI. Though it's 2/5th for ADD.

Sigh, and I said 249GFLOPs for RV770 based on 777MHz, not 750, so 240GFLOPs :oops:

Jawed
 
Either way, isn't 78 GFlops kinda weak even compared to CPUs?
Depends how you define "weak". Current Core2Quad is 2 (vec2 fp64) * 4 (cores) * 2 (mul+add) * clock flops. That's 48 gflops for a 3ghz c2q. So the presumed GTX280 would still be faster - but OTOH it would be in the same ballpark so you could certainly consider it weak (and it would look like a loser in the flops/power category).
 
78 GFLOPS is pretty poor peak for the reported size and power consumption of the board. It couldn't even beat the Cell board.
Well, are we sure it'd draw full power doing DP? I remember it was said G80's texture units drew a lot of power, so maybe if GT200's workload is DP-focused it won't get close to the expected gaming power draw.

This should make for an interesting page in any review's power draw stats, if they test multiple scenarios (even just power draw during different 3D Mark feature tests) rather than aim for peak draw.
 
Well, are we sure it'd draw full power doing DP? I remember it was said G80's texture units drew a lot of power, so maybe if GT200's workload is DP-focused it won't get close to the expected gaming power draw.
You're probably right (assuming all those SP units are just idling along) but still power draw would probably be quite a lot. btw is this supposed separate dp mad in addition to the 8 single precision mads (thus helping with the single precision flops too?) or replacing one of them?
 
Status
Not open for further replies.
Back
Top