NVIDIA GF100 & Friends speculation

Notice the 7 "Polymorph engines". That explains the Heaven scores…

I disagree with that.. If you look at nVIDIA slides, you will see GTX460 outperforming Radeon 5830, something which doesnt happen in Gigabyte slides.
The catch? Different configuration. Resolution is lower on NVIDIA slide than Gigabyte and there is no AA at nVIDIA one.
That tells me the problem in Gigabyte one is running out of memory/bottlenecks on the 192 bit memory bus.
 
But mightn't they attain significantly higher memory clocks?
I don't know, does NVidia fixing GDDR5 performance per pin have anything to do with going to 32nm? Seems unlikely to me. NVidia still isn't achieving GDDR5 speeds that you see on the 55nm HD4870.

Though to be fair my earlier point about the huge distance of the MCs from the I/O pads in GF100 and its effect on power/clocks could be key here. That should be lessened in GF104, since the die is smaller overall, and would be better still in a 32nm chip.

---

Maybe NVidia's saving up a full-spec GF104 with 1GHz+ GDDR5 for later?
 
So... how many GPCs?
That's the big question, isn't it? I've wondered ever since I got used to the idea that GF104 had 384 SPs why it's a "4" - clearly that's not standard NVIDIA nomenclature as it should be roughly 1/4, not more than 1/2! One possible explanation is that it refers to the number of GPCs, of which there would only be one...
 
48 SPs -- does that mean a third warp scheduler per cluster?
Fermi appears to send all even warps to one SIMD and all odd warps to the other (page 10 of the Architecture Whitepaper).

Is that actually what it's doing? Would GF104 be doing the same? Implying 24-wide SIMDs?

Since the register file is common to the SIMDs, the banking preferably "matches up" with the SIMD width.

Dunno, really, which way it goes.
 
From the GF100's die shot I simply can't figure out, how would a third SIMD "pack" be implemented within the current multiprocessor configuration?! It will break all the nice symmetry in there and bloat the whole thing, so a revamped warp size is possible by extending the two SIMD arrays.
 
Warp size of 32 seems "important" to NVidia, there's sort of a promise to CUDA developers for it to stay at 32. I forgot about that earlier, so I think 3 SIMDs.
 
For CUDA/OpenCL/DirectCompute, I think GF104 may be somewhat of a step backwards. GF100 has 64kB on-chip memory/(2 schedulers * 24 warps/scheduler * 32 work-items/warp) = ~42 Bytes per work-item. If GF104 has 3 schedulers and keeps the L1/Local Store the same, then GF104 may have 64 kB/(3 schedulers * 24 warps/scheduler * 32 work-items/warp) = ~28 Bytes per work-item of on-chip memory. This will make it harder to program than GF100.

I also wonder what they've done with the L2 cache in GF104...
 
Soo, anyone got extreme tesselation benches from other sources in heaven? I can't remember the difference being that big between hd5870 and gtx480 before
 
I just want to see nVidia put out a DX11 card with 8800 GTX-level performance with low enough power requirements that it only needs a single slot. I'm getting the impression that that's not remotely reasonable until at least the refresh of this architecture (presumably next year).
 
I just want to see nVidia put out a DX11 card with 8800 GTX-level performance with low enough power requirements that it only needs a single slot. I'm getting the impression that that's not remotely reasonable until at least the refresh of this architecture (presumably next year).

Aiming rather low aren't we?
GTS450/GF106 at high clocks should make short work of this.

ATI already did it with the 5750/70, it's double slot just for the fact that Eyefinity screams ports; new cards are technically single slotted but bracketed for two. If perf/W gets closer to Evergreen for GF10X then you might see your card there.

But then again unless you didn't jump on a 4850 (which is essentially that sans DX11) then this seems marvelous.
 
For CUDA/OpenCL/DirectCompute, I think GF104 may be somewhat of a step backwards. GF100 has 64kB on-chip memory/(2 schedulers * 24 warps/scheduler * 32 work-items/warp) = ~42 Bytes per work-item. If GF104 has 3 schedulers and keeps the L1/Local Store the same, then GF104 may have 64 kB/(3 schedulers * 24 warps/scheduler * 32 work-items/warp) = ~28 Bytes per work-item of on-chip memory. This will make it harder to program than GF100.
They could just reduce the warps per scheduler to 16.
 
Back
Top