NVIDIA GF100 & Friends speculation

In GF104 it appears that only 2 warps can issue at a time. Each warp can then issue upto 2 instructions. So we're talking about warp A issuing MAD with operands 1, 2 and 3, warp A issuing MAD with operands 4, 5, 6, warp B issuing MAD with operands 7, 8, 9 and warp B issuing a store with operand 10.

So the question is, is there enough register file bandwidth to support issue with those 10 operands?
Could that be the reason the "per-flop" efficiency seems to have gone down?
Compared to GTX465, it has pretty much more of everything considering clocks - memory bandwidth, rops, flops, tmus, even sfus (contrary to what I believed sfus also have increased per SM), of the latter two a lot more in fact. It does have less DP throughput, less setup/raster capability, but I seriously doubt anything of that makes a difference (unless tesselation is used at least) in games. Yet it still is only about as fast as the GTX465.

Though as a product, it is pretty impressive imho - of course considering the lackluster direct competitor, the HD5830, it isn't really unexpected but still this chip looks good. Very low idle power draw (with the help of downvolting memory at idle, which is a first for at least desktop graphic cards). It can't quite reach the HD5850 (though a slightly faster clocked full chip could) but it isn't really supposed to.

Also, it definitely looks like MC clock problems are gone. The cards use 1Ghz GDDR5 memory and seem to OC to 1.05Ghz quite easily (might be a tad worse than what Evergreen cards can do but still over rated speed).
Chip OC itself is also very good, reaching frequencies not possible since G92b. The GTX465 is sooooo dead, and in contrast to what I believed it in fact looks like a full chip, higher clocked GF104 could in fact replace GTX470 (looks to me like very slightly higher voltage 800/1600/1100Mhz clocked cards would be a quite viable product).

I'm impressed though with the amount of changes in the SMs compared to GF100. Almost feels like a different generation to me. Nvidia touted how GF100 is scalable, yet what they did with GF104 is almost a new architecture. Different dispatch, more ALUs, SFUs, TMUs per SM, different DP implementation (I wonder how that looks like internally, the official word is one 16xALU is DP capable at quarter speed, so is it indeed possible to dispatch two single precision instructions simultaneously?).
 
Last edited by a moderator:
different DP implementation (I wonder how that looks like internally, the official word is one 16xALU is DP capable at quarter speed, so is it indeed possible to dispatch two single precision instructions simultaneously?).
I don't think it's "different". It's just they way they do DP on consumer grade Fermis by emulating it with the SP ALUs. It's 1/12 the speed.

Some thoughts about the "superscalar" warp schedulers. Is the second dispatched instruction really from the *same* warp or simply from the next in the queue? As there are many warps in flight I can't understand why NVIDIA would exploit ILP on a warp level.
 
As there are many warps in flight I can't understand why NVIDIA would exploit ILP on a warp level.
I think it's because they can't guarantee that there will always be a large number of warps in flight. So if the number of warps drops, they can still keep the execution units active through ILP.
 
Some thoughts about the "superscalar" warp schedulers. Is the second dispatched instruction really from the *same* warp or simply from the next in the queue? As there are many warps in flight I can't understand why NVIDIA would exploit ILP on a warp level.
I suspect the approach NVidia has taken is because operand collection in this architecture is like ATI's, not like G80's. That's my interpretation.

By that I mean that operand collection is operating with a very short window, only for the next instruction's issue. 2 hot clocks of operand issue in NVidia, 4 in ATI.

In G80 operand collection was providing the functionality of a gather unit, for registers or memory - operands spend some time waiting there until all operands for all ALU lanes are ready to be issued for a single instruction. In Fermi operand collection doesn't need to do this, because that's the load/store unit's problem. All operands for ALU instructions solely come from the register file.

This type of operand collector is essentially a crossbar - but it crossbars in time and space, not just space. Throughput is constant, basically.

I'm not sure about the handling of constants in Fermi architecture. Constants can provoke waterfalling in ATI (as can shared memory reads or writes). Waterfalling stalls ATI. I don't know how Fermi avoids stalls for indexed or bank-conflicting constant fetches.

Additionally, I don't know the details of register file banking/porting/bandwidth in Fermi.

Jawed
 
I don't think it's "different". It's just they way they do DP on consumer grade Fermis by emulating it with the SP ALUs. It's 1/12 the speed.

Some thoughts about the "superscalar" warp schedulers. Is the second dispatched instruction really from the *same* warp or simply from the next in the queue? As there are many warps in flight I can't understand why NVIDIA would exploit ILP on a warp level.
I've seen rather underwhelming instruction rates with dependencies, so that seemingly indicates ILP extraction.
 
Another interesting bit is that GF104 has very low pixel fillrate (compared to HD5850 it's downright pathetic). Didn't think of that previously (Damien's article reminded me), but it still can only output 2 pixels/clock per SM, hence 14 pixels/clock in the GTX460 configuration - whereas the ROPs (well with the 1GB configuration) could handle 32 pixels/clock! Now certainly GF100 suffered from that too but nowhere to that extreme level, that looks a little unbalanced. Probably explains why the performance difference between the two GTX460 versions isn't that big - it has way more ROPs than needed anyway, the only difference this will make is very likely only due to the memory bandwidth. I'm wondering if this 14 pixel/clock limitation is also a contributing factor to why GTX460 isn't faster than GTX465.
Also, I consider it very surprising TMUs can now do full-speed FP16 filtering - the tex:alu ratio has increased AND the tmus are more capable?

Now that we know what GF104 looks like, what do you expect for GF106/GF108? For the former just the same as GF106, but only 1 GPC and half the MC/ROPs? Or some more changes again to the SMs themselves? Or still 2 GPC but only half the SMs per GPC? Still with DP? Or...? At least no matter how it will look like it should finally enable nvidia to retain their oldtimer, g92b, something that they still can't do with GF104 (though this one should certainly make gt200b obsolete). And I guess for GF108 1 GPC is pretty much a given, with just 2 SMs? And 64bit gddr5?
 
I don't think it's "different". It's just they way they do DP on consumer grade Fermis by emulating it with the SP ALUs. It's 1/12 the speed.
I don't think that's the case. GF100 consumer grade cards have 1/8 the DP throughput, and the official wording is that it's artificially limited to 1/4 of what the chip could do, not that it's emulated.
The one alu at 1/4 performance quote (hence 1/12 in total) however doesn't sound like "real" emulation, more like a solution similar to Cypress. Except AMD glues together 4 (slightly enhanced for this emulation) alus, whereas nvidia would loop 4 times over the same (similarly enhanced) alu (and in contrast to AMD, not all alus take part in this, which I guess means there's more hope we'll see this in all lower end products, not just GF104, since the enhanced alus should be very cheap to begin with and only 1/3 of them are enhanced).
 
Given the efficiencies ATI reaches with their 5-way VLIW, it doesn't seem like the compiler is facing a huge challenge in extracting an ILP of 2 on ALU bound shader code. I also wonder whether the schedulers factor ILP into the decision of which warps to schedule, or whether that's something handled solely at the dispatcher level.
 
http://www.semiaccurate.com/2010/07/12/nvidia-backpedals-gf100gtx480-underfill/

I just noticed that coincidentally on the day of the GTX 460's release Charlie is bringing up Bumpgate in association with the GTX 470 and 480. Specifically that nVidia went back to the problematic Namics 8439-1 again. Is there any independent confirmation of this and whether other design changes were made to mitigate concerns? I haven't read any reports of GTX 470/480 chip failures, although admittedly it's still early days. Presumably any bump problems would show up on the GTX 480M first since laptops see more thermal cycling from being off and on.

Interestingly, the GTX 460 wasn't roped into the story, so presumably that is safe?
 
Well that really nasty price war that weakened both companies since the launch of Rv770 kind of skews the value proposition of cards following (both Nvidia and ATI). I don't think either company wants to re-engage on that type of cut-throat price war if they can avoid it.

That's what I was thinking, too, but looking at the numbers it becomes apparent, that we're simply stagnating and he GTX260's price was actually nothing out of the ordinary when looking at it from a historic point of view:

Code:
Date   		Price	    Videocard             Performance improvement

April 2002:	~150€	GeForce 4 MX-440	     
April 2003:	~150€	9600 Non-Pro		    ~100%
April 2004:     ~150€	5700Ultra/9600XT	    ~50%
April 2005:     ~150€	6600GT/9800Pro	            ~100%
April 2006: 	~150€ 	7600GT/X1600XT              ~100%
April 2007: 	~150€ 	X1950Pro 512MiB             ~100%
April 2008: 	~150€ 	8800 GT 512MiB    	    ~125%
April 2009: 	~150€ 	HD4870/GTX 260 896MiB       ~60%
April 2010:     ~150€   HD 5770 1024MiB 	    ~0%

(Shamelessly ripped from here)


Looking at it from this perspective, we should have something faster than the GTX 260 1MiB for €150 by now and I don't really see the skewed value proposition.
 
Last edited by a moderator:
That's what I was thinking, too, but looking at the numbers it becomes apparent, that we're simply stagnating and he GTX260's price was actually nothing out of the ordinary when looking at it from a historic point of view:

Code:
Date   		Price	    Videocard             Performance improvement

April 2002:	~150€	GeForce 4 MX-440	     
April 2003:	~150€	9600 Non-Pro		    ~100%
April 2004:     ~150€	5700Ultra/9600XT	    ~50%
April 2005:     ~150€	6600GT/9800Pro	            ~100%
April 2006: 	~150€ 	7600GT/X1600XT              ~100%
April 2007: 	~150€ 	X1950Pro 512MiB             ~100%
April 2008: 	~150€ 	8800 GT 512MiB    	    ~25%
April 2009: 	~150€ 	HD4870/GTX 260 896MiB       ~60%
April 2010:     ~150€   HD 5770 1024MiB 	    ~0%

(Shamelessly ripped from here)


Looking at it from this perspective, we should have something faster than the GTX 260 1MiB for €150 by now and I don't really see the skewed value proposition.

Except that the odd cards out in that example is the 4870 (top level card from ATI) and the GTX 260 (salvage part from top end card from Nvidia).

All the other cards were actually made for the segment between 150-250 USD (adjusted for inflation in the case of older cards). The GTX 260 was made for the 400 USD segment (launched at 399 but was soon drasticly cut due to 4870/4850) and the 4870 was made for the 300+ USD segment (intended at 329 USD, but before launch ATI were already trying to undercut Nvidia and launched at 299 USD).

Your example exactly defines the point that I was making. The price war between ATI and Nvidia when ATI launched the Rv770 was fairly unprecedented in the history of GPUs.

It basically took the top cards of both manufacturers lineups and brought them down to Budget price levels (the 4870 hit the sub 100 USD price point) can't remember how far the original GTX 260 dropped.

I don't think either ATI nor Nvidia want to repeat that.

Regards,
SB
 
http://www.semiaccurate.com/2010/07/12/nvidia-backpedals-gf100gtx480-underfill/

I just noticed that coincidentally on the day of the GTX 460's release Charlie is bringing up Bumpgate in association with the GTX 470 and 480. Specifically that nVidia went back to the problematic Namics 8439-1 again. Is there any independent confirmation of this and whether other design changes were made to mitigate concerns? I haven't read any reports of GTX 470/480 chip failures, although admittedly it's still early days. Presumably any bump problems would show up on the GTX 480M first since laptops see more thermal cycling from being off and on.

Interestingly, the GTX 460 wasn't roped into the story, so presumably that is safe?

Well, GF104 draws about half the power of GF100 with about 70% of the die size (according to rumors). So I don't think it should have the same problems as GF100, and it probably uses the "right" underfill.
 
Given the efficiencies ATI reaches with their 5-way VLIW, it doesn't seem like the compiler is facing a huge challenge in extracting an ILP of 2 on ALU bound shader code. I also wonder whether the schedulers factor ILP into the decision of which warps to schedule, or whether that's something handled solely at the dispatcher level.

Anand's piece seemed to suggest that the ILP extraction happens at runtime, not compile time.
 
Anand's piece seemed to suggest that the ILP extraction happens at runtime, not compile time.

Since it doesn't look like having out-of-order exeuction, I think the ILP needs to be extracted by a compiler at some stages (i.e. from source code to cubin, or from cubin to some internal form).
 
Since it doesn't look like having out-of-order exeuction, I think the ILP needs to be extracted by a compiler at some stages (i.e. from source code to cubin, or from cubin to some internal form).

Nvidia told me that the ISA is the same and for GF104 the compiler just tries to be more careful about instruction ordering.
 
Since it doesn't look like having out-of-order exeuction, I think the ILP needs to be extracted by a compiler at some stages (i.e. from source code to cubin, or from cubin to some internal form).

An in order cpu can also dual issue instructions, like the pentium/atom.
 
If the compiler is doing "re-ordering" for GF104, that implies a kind of "soft" pair-wise vectorisation, e.g. within a 2-instruction window.

In other words if instructions 37 and 39 can dual-issue, then these instructions need to be paired by the compiler as 37 and 38 (or 38 and 39).

It might be that the window is longer, e.g. 4 instructions, so the compiler wouldn't have to do anything in this case. It would only need to re-order if the instructions were further apart.
 
An in order cpu can also dual issue instructions, like the pentium/atom.

Sure, but they need some help from compilers. Otherwise, if the compiler just generate a lot of dependent instructions, they can't be paired with an in-order CPU.

Jawed said:
It might be that the window is longer, e.g. 4 instructions, so the compiler wouldn't have to do anything in this case. It would only need to re-order if the instructions were further apart.

This is possible but I think the hardware would be too complex. It may even have no dependency check logic and rely on the internal compiler to do dependency check.
 
Back
Top