Could that be the reason the "per-flop" efficiency seems to have gone down?In GF104 it appears that only 2 warps can issue at a time. Each warp can then issue upto 2 instructions. So we're talking about warp A issuing MAD with operands 1, 2 and 3, warp A issuing MAD with operands 4, 5, 6, warp B issuing MAD with operands 7, 8, 9 and warp B issuing a store with operand 10.
So the question is, is there enough register file bandwidth to support issue with those 10 operands?
Compared to GTX465, it has pretty much more of everything considering clocks - memory bandwidth, rops, flops, tmus, even sfus (contrary to what I believed sfus also have increased per SM), of the latter two a lot more in fact. It does have less DP throughput, less setup/raster capability, but I seriously doubt anything of that makes a difference (unless tesselation is used at least) in games. Yet it still is only about as fast as the GTX465.
Though as a product, it is pretty impressive imho - of course considering the lackluster direct competitor, the HD5830, it isn't really unexpected but still this chip looks good. Very low idle power draw (with the help of downvolting memory at idle, which is a first for at least desktop graphic cards). It can't quite reach the HD5850 (though a slightly faster clocked full chip could) but it isn't really supposed to.
Also, it definitely looks like MC clock problems are gone. The cards use 1Ghz GDDR5 memory and seem to OC to 1.05Ghz quite easily (might be a tad worse than what Evergreen cards can do but still over rated speed).
Chip OC itself is also very good, reaching frequencies not possible since G92b. The GTX465 is sooooo dead, and in contrast to what I believed it in fact looks like a full chip, higher clocked GF104 could in fact replace GTX470 (looks to me like very slightly higher voltage 800/1600/1100Mhz clocked cards would be a quite viable product).
I'm impressed though with the amount of changes in the SMs compared to GF100. Almost feels like a different generation to me. Nvidia touted how GF100 is scalable, yet what they did with GF104 is almost a new architecture. Different dispatch, more ALUs, SFUs, TMUs per SM, different DP implementation (I wonder how that looks like internally, the official word is one 16xALU is DP capable at quarter speed, so is it indeed possible to dispatch two single precision instructions simultaneously?).
Last edited by a moderator: