it doesn't matter how good NVIDIA's architecture is, a 1/4 performance ratio vs. 1/8 seems unlikely to be overcome.
If the hit for DP is as big as that image says NVIDIA is going to give AMD/ATI a huge opportunity to get their foot in the door ... it doesn't matter how good NVIDIA's architecture is, a 1/4 performance ratio vs. 1/8 seems unlikely to be overcome.
Thx I haven`t read this article. So it means that ATI has some advantages over NVIDIA when NVIDIAs next-gen GT200 will do DP with only 1/8 speed of SP.
How do you figure that?Moreover ATIs RV670 can do it with "only" 1/4 speed of SP bu it`s nextgen RV770 may bring even better results.
Thx I haven`t read this article. So it means that ATI has some advantages over NVIDIA when NVIDIAs next-gen GT200 will do DP with only 1/8 speed of SP.
Moreover ATIs RV670 can do it with "only" 1/4 speed of SP bu it`s nextgen RV770 may bring even better results.
According to that article, running DP at one quarter SP speed is actually a worst case; best case can be as fast as half SP speed. Real-world performance is therefore somewhere in between the two, i.e. faster than one quarter nearly all the time.
According to that article, running DP at one quarter SP speed is actually a worst case; best case can be as fast as half SP speed. Real-world performance is therefore somewhere in between the two, i.e. faster than one quarter nearly all the time.
Wtf NVIDIAs nextgen is G92 based card and will be called GF9800GTX Clock is about 750-800Mhz which means it`s only a few percent faster than new GF8800GTS.
Then i ask where is rumoured GPU with 512-bit bus?
It seems ATI will destroy NVIDIA with it`s R7xx GPUs
PS. Do you think it`s rumoured 1Tflop GPU from NVIDIA? If yes, Shader clock should be about 2,4Ghz/
Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.Hehe, they must have something like a 64bit internal adder to get 1/4 speed 64bit MULs. Pretty cheap, but helps speed a lot.
I'm curious what you're thinking might be the effect of the scalar core.Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.
They're most likely doing FP64 MULs by loopback FP32 MADDs but to do it in only 4 cycles you need to do a single cycle >53bit (by my estimation) ADD portion on the MADD units. I'm not sure what you mean by "coupling multipliers" here. If you can't do that >53bit add in a single cycle then you have to take something like 6-8 cycles at least, which is why the 8x hit doesn't seem too surprising to me. You don't need that single cycle 53bit add for FP32.All the wires to connect the amalgam dp multiplier to the register set are already there in the vector core, with scalar cores when you start coupling multipliers you need to connect the multiplier of one core to the register set of the other. You could of course do DP multiplies by quadruple pumping the multiplier inside a scalar core but then your latency is twice as high. (Whatever NVIDIA did for the moment is even slightly worse than that, but I dunno why.)
It really doesn't matter that the cores are scalar, at least if they're doing it via loopback which is most likely the case.Maybe NVIDIA had a more difficult time because of the scalar cores, although I'm sure when they have their next architecture they will do dp multiply(-adds) at 1/4 too.
Have any of you guys considered the possibility of a g90(not g92) core with the same area as g80 but at 65nm and with 384 scalar shader units(at whatever clock they can squeeze)?
The possibility exists, but judging by the current G92-based 8800 GTS 512MB levels, the amount of heat output and power consumption for such a card would be enormous, far surpassing the 8800 Ultra.
Essentially what you are suggesting is building a virtual array multiplier out of the single precision multiplier, combining the results from 4 cycles. I'm suggesting building a virtual array multiplier by combining 4 results from 2 multipliers from 2 cycles ... with the advantage being lower latency.I'm not sure what you mean by "coupling multipliers" here.
Essentially what you are suggesting is building a virtual array multiplier out of the single precision multiplier, combining the results from 4 cycles. I'm suggesting building a virtual array multiplier by combining 4 results from 2 multipliers from 2 cycles ... with the advantage being lower latency.
On second thought, it's not half the latency though ... the ratio is (sp latency + 1)/(sp latency + 3) ... still a substantial difference, but a lot less impressive. If you have the multipliers side by side I see no reason to go for the higher latency option, but on the other hand no real reason for NVIDIA to try.