NVIDIA GF100 & Friends speculation

In Fermi there's varying instruction-throughput and latency, i.e. SP-MAD versus DP-MAD versus RCP versus more complex instructions that run on SFU (which actually appears to be the multifunction interpolator of old, since it also does interpolations).

That complexity could be tackled either by the compiler, by the scheduler/issuer or some combination of the two.

I suppose it's fair to say that soft-vectorisation enjoys the same options - so overall we're none the wiser.
 
It may even have no dependency check logic and rely on the internal compiler to do dependency check.
So the compiler would need to tag the instructions if they are dependent on the previous one or not? I guess that would be doable, but since Tridam is saying ISA is the same it's apparently done in hw. If that's the case, another question would be how complex this is - that is it might generate false positives if the logic is simple, the hw thinks it'd dependent when it wouldn't actually be.
 
So the compiler would need to tag the instructions if they are dependent on the previous one or not? I guess that would be doable, but since Tridam is saying ISA is the same it's apparently done in hw. If that's the case, another question would be how complex this is - that is it might generate false positives if the logic is simple, the hw thinks it'd dependent when it wouldn't actually be.

Since NVIDIA never exposed their internal instruction sets, I think it's still possible to do that in software as long as the ptx/cubin is the same.
 
If the compiler is doing "re-ordering" for GF104, that implies a kind of "soft" pair-wise vectorisation, e.g. within a 2-instruction window.

In other words if instructions 37 and 39 can dual-issue, then these instructions need to be paired by the compiler as 37 and 38 (or 38 and 39).
Is there a reason why this is any different than what a compiler would do for a CPU when scheduling for an in-order superscalar processor? It's just instruction scheduling in that case.

It might be that the window is longer, e.g. 4 instructions, so the compiler wouldn't have to do anything in this case. It would only need to re-order if the instructions were further apart.
That would indicate the design is out-of-order. Without a significant revamp of the architecture, it would very strongly indicate neither GF104 or GF100 have that capability (amongst other indicators).
The need for static scheduling indicates GF104 cannot, and the likelihood of a massive revamp of the issue hardware being small means that GF104's progenitor did not have that trait to pass on to its offspring.
 
Unless NVIDIA decided to bite the bullet and go for a dual-ported register file, which is insane, their claims are absurd. The fact that instructions are independent is a minor problem at best compared to the impossibility to, you know, get any registers whatsoever to the third ALU. Simply adding more banks won't fix it. Or are we to believe the hardware is now so intelligent that it can handle arbitrary register bank conflicts at runtime even in their extreme case where half of the registers conflict with the entire other half? What next, a fixed-function Strong AI to handle scheduling?

I can imagine many different solutions, but I'm not sure I care enough to speculate on it further about this with no data to judge it on and no board. And some of those solutions would assume what we know about GF100 is not even correct in the first place, so that'd be an awful lot of thinking and testing to do. Ah, it's obvious I'll know sooner by following my current course of action than by pondering this further. Sorry if it look like I'm rambling, I'll shut up now :)
 
Why would the bank conflicts issue be a problem at with 48 units and not when there were 32?
The operand collectors are supposed to handle such conflicts as they arise.

In the case where the banking is not changed, the collectors could theoretically be used to opportunistically supply operands to the third SIMD whenever the code mix doesn't consist of FMAs, since the register file was specced to supply the operands needed for full throughput 3-operand instructions in Fermi. The downside to this is a longer spool-up period when operands are being gathered, and an idling third SIMD if the code is full of FMAs.

One thing that might not be hidden in that case is the write-back bandwidth, which does seem to be better served by banking the register file more heavily than before. Possibly buffering at writeback or in the operand collectors could somehow save on bandwidth, but there would be situations where it could pile up and force a SM-wide stall if the number of write ports is smaller than the number of writes.
 
Is there a reason why this is any different than what a compiler would do for a CPU when scheduling for an in-order superscalar processor? It's just instruction scheduling in that case.
Yeah, I can't see any difference. Which chips/compilers do this routinely?

That would indicate the design is out-of-order. Without a significant revamp of the architecture, it would very strongly indicate neither GF104 or GF100 have that capability (amongst other indicators).
The need for static scheduling indicates GF104 cannot, and the likelihood of a massive revamp of the issue hardware being small means that GF104's progenitor did not have that trait to pass on to its offspring.
The issue rate is pretty poor in the architecture as a whole, yes. But then load/store probably doesn't have the throughput anyway (running at core clock and half-warp wide). Much the same as SFU doesn't have the throughput, either.

So not a big deal.
 
Unless NVIDIA decided to bite the bullet and go for a dual-ported register file, which is insane, their claims are absurd.
How many banks does the register file have?

The fact that instructions are independent is a minor problem at best compared to the impossibility to, you know, get any registers whatsoever to the third ALU. Simply adding more banks won't fix it.
All along I've been querying register file bandwidth. But even if MAD+MAD+MAD+store isn't possible (with 10 independent operands) a mix with less independent operands should be possible.

Or are we to believe the hardware is now so intelligent that it can handle arbitrary register bank conflicts at runtime even in their extreme case where half of the registers conflict with the entire other half? What next, a fixed-function Strong AI to handle scheduling?
In the prior architecture the compiler's job included bank-allocation shaping (wide or tall) to fit the usage of registers. I can't think of any reason why that isn't being done in Fermi - though no reason to believe it's definitely in there, either.

I can imagine many different solutions, but I'm not sure I care enough to speculate on it further about this with no data to judge it on and no board. And some of those solutions would assume what we know about GF100 is not even correct in the first place, so that'd be an awful lot of thinking and testing to do. Ah, it's obvious I'll know sooner by following my current course of action than by pondering this further. Sorry if it look like I'm rambling, I'll shut up now :)
I linked a pile of ixbt's shader benchmarks but I just can't be bothered to dig into them.

Aside from that, with each major new chip NVidia says "near perfect ALU utilisation" then the next chip has "much better ALU utilisation". This is, erm, at least the third time now since G80.

I'm loving the irony of the de-facto explicit 2-way vectorisation that's being done with GF104. Subject, of course, to the vagaries of a register file that wasn't designed for it, probably.

Is this going to be as treacherous as the "missing MUL"? Will anyone really care?

Jawed
 
In the case where the banking is not changed, the collectors could theoretically be used to opportunistically supply operands to the third SIMD whenever the code mix doesn't consist of FMAs, since the register file was specced to supply the operands needed for full throughput 3-operand instructions in Fermi. The downside to this is a longer spool-up period when operands are being gathered, and an idling third SIMD if the code is full of FMAs.
I think that they're acting merely as crossbars, lane-swizzling and slewing operands over the 2 cycles that an instruction needs them.

Like ATI's operand collector: only handling operands for the instruction that's about to be issued - 12 operands over 4 cycles (i.e. 48x 32-bits).

Each MAD SIMD would have an independent operand collector and there'd be one for the SFU SIMD and another for the load/store SIMD.

Of course, that's just a theory.
 
Yeah, I can't see any difference. Which chips/compilers do this routinely?

If a chip is in-order, an optimization such as arranging instructions so that neighboring instructions do not have dependences is routine.
The calculation can go further and take into account instruction latencies so that the a consumer is not stalled waiting for a forwarded result, and that can apply even to scalar in-order pipelines.
Compilers like GCC have far more options and do much more than that, though.

They also focus on decode and issue restrictions, which can be important even for OOE chips, then there's a mass of other optimizations beyond just a dependence analsysis.
 
If a chip is in-order, an optimization such as arranging instructions so that neighboring instructions do not have dependences is routine.
Moderately expensive both in hardware and in the compiler. 2-way ILP shouldn't be too troublesome though. Easier than 5-way. Though register bandwidth is still a thorny question.

The calculation can go further and take into account instruction latencies so that the a consumer is not stalled waiting for a forwarded result, and that can apply even to scalar in-order pipelines.
Compilers like GCC have far more options and do much more than that, though.

They also focus on decode and issue restrictions, which can be important even for OOE chips, then there's a mass of other optimizations beyond just a dependence analsysis.
I took all that kind of stuff as-read in the "ultimate case" for GF104, since 4-issue requires at least one issue to LD/ST or SFU.

Maybe they are just peep-hole optimising for the low-hanging fruit?
 
H
All along I've been querying register file bandwidth. But even if MAD+MAD+MAD+store isn't possible (with 10 independent operands) a mix with less independent operands should be possible.
Interestingly, if 3 MAD's aren't possible, all the quoted gflops numbers would be wrong...
FWIW, the intel gen x (i965 and newer) gmas can't do MAD at all. The architecture however has a special accumulate register so you can do a MAC operation. Clearly, that's among others also because of simpler register fetch hardware (the hw has fixed size instructions so it saves instruction cache too). The hardware has some special instructions which use more than 2 source registers but these are in the vector arithmetic, not parallel arithmetic group - either using implied registers or certain subregisters, but in any case clearly the hw cannot fetch the amount of operands needed for MAD from GRF.
I think MAD always asks for interesting tradeoffs for everyone - it's an odd instruction since it's the only "normal" alu instruction using 3 operands, so for simpler instruction decoding / register fetch you don't want it, OTOH it's an important instruction so you can't just ignore it plus the alu has mul and add anyway so it's free there. So the compromise to not be able to issue MADs on all your execution resources might make sense - we know it's true for AMD, it should be possible to write a simple test for nvidia to figure it out.
 
Interestingly, if 3 MAD's aren't possible, all the quoted gflops numbers would be wrong...
A few shared operands would make it possible. That's how ATI can do five MADs per cycle, by having some operands shared (or, erm, with literals, now I think about it).

I think MAD always asks for interesting tradeoffs for everyone - it's an odd instruction since it's the only "normal" alu instruction using 3 operands, so for simpler instruction decoding / register fetch you don't want it,
Yeah, 3 operands are pretty problematic. There's an Intel blogger discussion I stumbled across about the uncertainties of implementing MAD/FMA in AVX. Can't find it now. Basic questions were still up in the air, not that long ago...

Register file bandwidth of MAD is slightly less onerous in a GPU if you have to implement import/export (gather/scatter) and texturing (sending addresses to TMU and receiving results) anyway. In ATI the ALUs can only consume 75% of the read bandwidth of the register file.

OTOH it's an important instruction so you can't just ignore it plus the alu has mul and add anyway so it's free there.
Well, FMA and MAD are distinct instructions and they can't be used interchangeably. ATI appears to only do FMA in DP-capable GPUs (alongside MAD, but only on XYZW), while Fermi appears to only do FMA.

So the compromise to not be able to issue MADs on all your execution resources might make sense - we know it's true for AMD, it should be possible to write a simple test for nvidia to figure it out.
Even if 3-way MAD is heavily limited by operand bandwidth, freedom to issue SFU and LD/ST alongside 2x MAD is still a benefit.

ATI does LD/ST for a different hardware thread, in the same way as texturing is done for a separate thread.
 
3 operand is more troublesome in a fixed length ISA than a variable length one. x86 is already a bloody mess, so 3 operand isn't exactly going to be ugly relative to the ISA. There are however microarchitectural issues at play.
 
A few shared operands would make it possible. That's how ATI can do five MADs per cycle, by having some operands shared (or, erm, with literals, now I think about it).
Oh never thought about this so AMD's gflops are a bit cheated too just like with the missing mul. Possible in theory pretty much impossible in practice...
Yeah, 3 operands are pretty problematic. There's an Intel blogger discussion I stumbled across about the uncertainties of implementing MAD/FMA in AVX. Can't find it now. Basic questions were still up in the air, not that long ago...
I thought it was pretty much decided a long time ago Sandy Bridge would use FMA with 3 operands, but no separate destination (so one of the operands is overwritten). I'm not sure exactly how this helps the hardware, since you still need to fetch 3 operands, but apparently it is easier to implement.
 
Oh never thought about this so AMD's gflops are a bit cheated too just like with the missing mul. Possible in theory pretty much impossible in practice...
:LOL: Something like matrix multiplication has no trouble issuing 5 MADs per cycle. The bottleneck is getting data into the registers fast enough, not the MADs. Oh and the compiler, which really struggles with, say, 32-way ILP.

I thought it was pretty much decided a long time ago Sandy Bridge would use FMA with 3 operands, but no separate destination (so one of the operands is overwritten). I'm not sure exactly how this helps the hardware, since you still need to fetch 3 operands, but apparently it is easier to implement.
Hmm, it wasn't a blog post and I think this is where I read it:

http://software.intel.com/en-us/forums/showthread.php?t=61121

Lots of juicy stuff in there. From the horse's mouth.
 
Back
Top