NVIDIA GT200 Rumours & Speculation Thread

MfA · Feb 17, 2008

If the hit for DP is as big as that image says NVIDIA is going to give AMD/ATI a huge opportunity to get their foot in the door ... it doesn't matter how good NVIDIA's architecture is, a 1/4 performance ratio vs. 1/8 seems unlikely to be overcome.

BRiT · Feb 17, 2008

MfA said:
it doesn't matter how good NVIDIA's architecture is, a 1/4 performance ratio vs. 1/8 seems unlikely to be overcome.

What if Nvidia's architecture is twice as fast? 1/8 X >= 1/4 Y iff X >= 2Y

I'm not saying that's the case, but what if ...

Domell · Feb 17, 2008

MfA said:
If the hit for DP is as big as that image says NVIDIA is going to give AMD/ATI a huge opportunity to get their foot in the door ... it doesn't matter how good NVIDIA's architecture is, a 1/4 performance ratio vs. 1/8 seems unlikely to be overcome.

How do you know how fast is ATI with DP?

phantommm · Feb 17, 2008

Domell said:
How do you know how fast is ATI with DP?

http://www.tgdaily.com/content/view/35894/135/

Domell · Feb 17, 2008

phantommm said:
http://www.tgdaily.com/content/view/35894/135/

Thx

I haven`t read this article. So it means that ATI has some advantages over NVIDIA when NVIDIAs next-gen GT200 will do DP with only 1/8 speed of SP.
Moreover ATIs RV670 can do it with "only" 1/4 speed of SP bu it`s nextgen RV770 may bring even better results.

nutball · Feb 17, 2008

Domell said:
Thx I haven`t read this article. So it means that ATI has some advantages over NVIDIA when NVIDIAs next-gen GT200 will do DP with only 1/8 speed of SP.

TBH the winner is more likely to be whichever can give best performance on the problem at hand given a finite amount of developer time. Peak FLOPS are great and all, good for impressing women at parties and for press releases, but mean diddly if you can't achieve them in your app.

Moreover ATIs RV670 can do it with "only" 1/4 speed of SP bu it`s nextgen RV770 may bring even better results.

How do you figure that?

Farhan · Feb 17, 2008

Domell said:
Thx I haven`t read this article. So it means that ATI has some advantages over NVIDIA when NVIDIAs next-gen GT200 will do DP with only 1/8 speed of SP.
Moreover ATIs RV670 can do it with "only" 1/4 speed of SP bu it`s nextgen RV770 may bring even better results.

Hehe, they must have something like a 64bit internal adder to get 1/4 speed 64bit MULs. Pretty cheap, but helps speed a lot.

nicolasb · Feb 18, 2008

phantommm said:
http://www.tgdaily.com/content/view/35894/135/

According to that article, running DP at one quarter SP speed is actually a worst case; best case can be as fast as half SP speed. Real-world performance is therefore somewhere in between the two, i.e. faster than one quarter nearly all the time.

nutball · Feb 18, 2008

nicolasb said:
According to that article, running DP at one quarter SP speed is actually a worst case; best case can be as fast as half SP speed. Real-world performance is therefore somewhere in between the two, i.e. faster than one quarter nearly all the time.

Presumably ADDs are 1/2 speed, MULs are 1/4 speed, so the overall headline figure depends on the instruction mix.

Domell · Feb 18, 2008

Wtf NVIDIAs nextgen is G92 based card and will be called GF9800GTX

Clock is about 750-800Mhz which means it`s only a few percent faster than new GF8800GTS.
Then i ask where is rumoured GPU with 512-bit bus?
It seems ATI will destroy NVIDIA with it`s R7xx GPUs

PS. Do you think it`s rumoured 1Tflop GPU from NVIDIA? If yes, Shader clock should be about 2,4Ghz/

Kaotik · Feb 18, 2008

Domell said:
Wtf NVIDIAs nextgen is G92 based card and will be called GF9800GTX Clock is about 750-800Mhz which means it`s only a few percent faster than new GF8800GTS.
Then i ask where is rumoured GPU with 512-bit bus?
It seems ATI will destroy NVIDIA with it`s R7xx GPUs

PS. Do you think it`s rumoured 1Tflop GPU from NVIDIA? If yes, Shader clock should be about 2,4Ghz/

The rumored chip should be GT200/G100, which this thread is about, not 9800-anything

MfA · Feb 18, 2008

Farhan said:
Hehe, they must have something like a 64bit internal adder to get 1/4 speed 64bit MULs. Pretty cheap, but helps speed a lot.

Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.

Jawed · Feb 18, 2008

MfA said:
Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.

I'm curious what you're thinking might be the effect of the scalar core.

Jawed

MfA · Feb 18, 2008

All the wires to connect the amalgam dp multiplier to the register set are already there in the vector core, with scalar cores when you start coupling multipliers you need to connect the multiplier of one core to the register set of the other. You could of course do DP multiplies by quadruple pumping the multiplier inside a scalar core but then your latency is twice as high. (Whatever NVIDIA did for the moment is even slightly worse than that, but I dunno why.)

Farhan · Feb 18, 2008

MfA said:
All the wires to connect the amalgam dp multiplier to the register set are already there in the vector core, with scalar cores when you start coupling multipliers you need to connect the multiplier of one core to the register set of the other. You could of course do DP multiplies by quadruple pumping the multiplier inside a scalar core but then your latency is twice as high. (Whatever NVIDIA did for the moment is even slightly worse than that, but I dunno why.)

They're most likely doing FP64 MULs by loopback FP32 MADDs but to do it in only 4 cycles you need to do a single cycle >53bit (by my estimation) ADD portion on the MADD units. I'm not sure what you mean by "coupling multipliers" here. If you can't do that >53bit add in a single cycle then you have to take something like 6-8 cycles at least, which is why the 8x hit doesn't seem too surprising to me. You don't need that single cycle 53bit add for FP32.

MfA said:
Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.

It really doesn't matter that the cores are scalar, at least if they're doing it via loopback which is most likely the case.

compres · Feb 19, 2008

Have any of you guys considered the possibility of a g90(not g92) core with the same area as g80 but at 65nm and with 384 scalar shader units(at whatever clock they can squeeze)?

INKster · Feb 19, 2008

compres said:
Have any of you guys considered the possibility of a g90(not g92) core with the same area as g80 but at 65nm and with 384 scalar shader units(at whatever clock they can squeeze)?

The possibility exists, but judging by the current G92-based 8800 GTS 512MB levels, the amount of heat output and power consumption for such a card would be enormous, far surpassing the 8800 Ultra.

LordEC911 · Feb 19, 2008

INKster said:
The possibility exists, but judging by the current G92-based 8800 GTS 512MB levels, the amount of heat output and power consumption for such a card would be enormous, far surpassing the 8800 Ultra.

Plus would be quite a large tweak for something, that most assume, performs worse than the GX2...
9800GTX has to be G92 based, it simply wouldn't make sense if it wasn't.

MfA · Feb 19, 2008

Farhan said:
I'm not sure what you mean by "coupling multipliers" here.

Essentially what you are suggesting is building a virtual array multiplier out of the single precision multiplier, combining the results from 4 cycles. I'm suggesting building a virtual array multiplier by combining 4 results from 2 multipliers from 2 cycles ... with the advantage being lower latency.

On second thought, it's not half the latency though ... the ratio is (sp latency + 1)/(sp latency + 3) ... still a substantial difference, but a lot less impressive. If you have the multipliers side by side I see no reason to go for the higher latency option, but on the other hand no real reason for NVIDIA to try.

Farhan · Feb 19, 2008

MfA said:
Essentially what you are suggesting is building a virtual array multiplier out of the single precision multiplier, combining the results from 4 cycles. I'm suggesting building a virtual array multiplier by combining 4 results from 2 multipliers from 2 cycles ... with the advantage being lower latency.

On second thought, it's not half the latency though ... the ratio is (sp latency + 1)/(sp latency + 3) ... still a substantial difference, but a lot less impressive. If you have the multipliers side by side I see no reason to go for the higher latency option, but on the other hand no real reason for NVIDIA to try.

For AMD's case that would probably work, but then they need to have a rather wide forwarding path between the output of one adder to the input of the other. It may slow things down for the common (FP32) case.

For nvidia, they still need to have that wider add (assuming the 8+ cycle DP is true)...

NVIDIA GT200 Rumours & Speculation Thread

MfA

BRiT

(>• •)>⌐■-■ (⌐■-■)

Domell

phantommm

Domell

nutball

Farhan

nicolasb

nutball

Domell

Kaotik

Drunk Member

MfA

Jawed

MfA

Farhan

compres

INKster

LordEC911

MfA

Farhan

Similar threads