NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.
If the hit for DP is as big as that image says NVIDIA is going to give AMD/ATI a huge opportunity to get their foot in the door ... it doesn't matter how good NVIDIA's architecture is, a 1/4 performance ratio vs. 1/8 seems unlikely to be overcome.
 
it doesn't matter how good NVIDIA's architecture is, a 1/4 performance ratio vs. 1/8 seems unlikely to be overcome.

What if Nvidia's architecture is twice as fast? 1/8 X >= 1/4 Y iff X >= 2Y

I'm not saying that's the case, but what if ...
 
If the hit for DP is as big as that image says NVIDIA is going to give AMD/ATI a huge opportunity to get their foot in the door ... it doesn't matter how good NVIDIA's architecture is, a 1/4 performance ratio vs. 1/8 seems unlikely to be overcome.

How do you know how fast is ATI with DP?
 
Thx :) I haven`t read this article. So it means that ATI has some advantages over NVIDIA when NVIDIAs next-gen GT200 will do DP with only 1/8 speed of SP.

TBH the winner is more likely to be whichever can give best performance on the problem at hand given a finite amount of developer time. Peak FLOPS are great and all, good for impressing women at parties and for press releases, but mean diddly if you can't achieve them in your app.

Moreover ATIs RV670 can do it with "only" 1/4 speed of SP bu it`s nextgen RV770 may bring even better results.
How do you figure that?
 
Thx :) I haven`t read this article. So it means that ATI has some advantages over NVIDIA when NVIDIAs next-gen GT200 will do DP with only 1/8 speed of SP.
Moreover ATIs RV670 can do it with "only" 1/4 speed of SP bu it`s nextgen RV770 may bring even better results.

Hehe, they must have something like a 64bit internal adder to get 1/4 speed 64bit MULs. Pretty cheap, but helps speed a lot.
 
According to that article, running DP at one quarter SP speed is actually a worst case; best case can be as fast as half SP speed. Real-world performance is therefore somewhere in between the two, i.e. faster than one quarter nearly all the time.

Presumably ADDs are 1/2 speed, MULs are 1/4 speed, so the overall headline figure depends on the instruction mix.
 
Wtf NVIDIAs nextgen is G92 based card and will be called GF9800GTX :( Clock is about 750-800Mhz which means it`s only a few percent faster than new GF8800GTS.
Then i ask where is rumoured GPU with 512-bit bus?
It seems ATI will destroy NVIDIA with it`s R7xx GPUs

PS. Do you think it`s rumoured 1Tflop GPU from NVIDIA? If yes, Shader clock should be about 2,4Ghz/
 
Last edited by a moderator:
Wtf NVIDIAs nextgen is G92 based card and will be called GF9800GTX :( Clock is about 750-800Mhz which means it`s only a few percent faster than new GF8800GTS.
Then i ask where is rumoured GPU with 512-bit bus?
It seems ATI will destroy NVIDIA with it`s R7xx GPUs

PS. Do you think it`s rumoured 1Tflop GPU from NVIDIA? If yes, Shader clock should be about 2,4Ghz/

The rumored chip should be GT200/G100, which this thread is about, not 9800-anything
 
Hehe, they must have something like a 64bit internal adder to get 1/4 speed 64bit MULs. Pretty cheap, but helps speed a lot.
Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.
 
Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.
I'm curious what you're thinking might be the effect of the scalar core.

Jawed
 
All the wires to connect the amalgam dp multiplier to the register set are already there in the vector core, with scalar cores when you start coupling multipliers you need to connect the multiplier of one core to the register set of the other. You could of course do DP multiplies by quadruple pumping the multiplier inside a scalar core but then your latency is twice as high. (Whatever NVIDIA did for the moment is even slightly worse than that, but I dunno why.)
 
All the wires to connect the amalgam dp multiplier to the register set are already there in the vector core, with scalar cores when you start coupling multipliers you need to connect the multiplier of one core to the register set of the other. You could of course do DP multiplies by quadruple pumping the multiplier inside a scalar core but then your latency is twice as high. (Whatever NVIDIA did for the moment is even slightly worse than that, but I dunno why.)
They're most likely doing FP64 MULs by loopback FP32 MADDs but to do it in only 4 cycles you need to do a single cycle >53bit (by my estimation) ADD portion on the MADD units. I'm not sure what you mean by "coupling multipliers" here. If you can't do that >53bit add in a single cycle then you have to take something like 6-8 cycles at least, which is why the 8x hit doesn't seem too surprising to me. You don't need that single cycle 53bit add for FP32.


Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.
It really doesn't matter that the cores are scalar, at least if they're doing it via loopback which is most likely the case.
 
Have any of you guys considered the possibility of a g90(not g92) core with the same area as g80 but at 65nm and with 384 scalar shader units(at whatever clock they can squeeze)?
 
Have any of you guys considered the possibility of a g90(not g92) core with the same area as g80 but at 65nm and with 384 scalar shader units(at whatever clock they can squeeze)?

The possibility exists, but judging by the current G92-based 8800 GTS 512MB levels, the amount of heat output and power consumption for such a card would be enormous, far surpassing the 8800 Ultra.
 
The possibility exists, but judging by the current G92-based 8800 GTS 512MB levels, the amount of heat output and power consumption for such a card would be enormous, far surpassing the 8800 Ultra.

Plus would be quite a large tweak for something, that most assume, performs worse than the GX2...
9800GTX has to be G92 based, it simply wouldn't make sense if it wasn't.
 
I'm not sure what you mean by "coupling multipliers" here.
Essentially what you are suggesting is building a virtual array multiplier out of the single precision multiplier, combining the results from 4 cycles. I'm suggesting building a virtual array multiplier by combining 4 results from 2 multipliers from 2 cycles ... with the advantage being lower latency.

On second thought, it's not half the latency though ... the ratio is (sp latency + 1)/(sp latency + 3) ... still a substantial difference, but a lot less impressive. If you have the multipliers side by side I see no reason to go for the higher latency option, but on the other hand no real reason for NVIDIA to try.
 
Last edited by a moderator:
Essentially what you are suggesting is building a virtual array multiplier out of the single precision multiplier, combining the results from 4 cycles. I'm suggesting building a virtual array multiplier by combining 4 results from 2 multipliers from 2 cycles ... with the advantage being lower latency.

On second thought, it's not half the latency though ... the ratio is (sp latency + 1)/(sp latency + 3) ... still a substantial difference, but a lot less impressive. If you have the multipliers side by side I see no reason to go for the higher latency option, but on the other hand no real reason for NVIDIA to try.

For AMD's case that would probably work, but then they need to have a rather wide forwarding path between the output of one adder to the input of the other. It may slow things down for the common (FP32) case.

For nvidia, they still need to have that wider add (assuming the 8+ cycle DP is true)...
 
Status
Not open for further replies.
Back
Top