Thanks for linking these patents Marco!
I know you pointed me to them earlier, but I'll admit I hadn't read the one on shared registers, and that's a really smart idea IMO. I wonder what their average savings in real-world shaders are...
the throughput of 2nd MUL( which perform on sfu ) on g80/g83 is 4 cycle, and 1 cycle on g86
As indicated in that patent, there are two theoretical limitations:
- If the batch size (32 for PS, 16 for VS) is equal to the warp size (16), then either the ALU or the SFU will idle. If the batch size is twice the warp size, then the scheduler has enough time to co-issue both instructions.
- Only four register operands may be collected at a time, and these need to be shared between the ALU and the SFU.
G80 seems to have serious limitations regarding reusing registers, and I doubt that's just a driver limitation. G86, on the other hand, probably has some new hardware to reuse (cache? keep last used? FIFO?) registers more intelligently so that the two pipelines may be co-issued more efficiently (and easily).
I actually did get 1.5x the MUL rate on G80, with just "r0=r0*r0;r1=r1*r1;", see this thread:
http://forum.beyond3d.com/showpost.php?p=1011976&postcount=36
My conclusion was that the hardware could reuse the same register for the SFU twice, so that it only had to read r1 once there, but that it was not capable of reusing registers in the ALU, so it had to read r0 twice. This implied that three operands per processor had to be collected for the ALU, leaving only one per clock for the SFU/MUL. As such, the SFU/MUL could only issue one instruction every two clock cycles, resulting in 1.5x throughput.
ADD+MUL could also reach 1.5x if you were smart about it iirc (I think I did that, but I'm not sure anymore; it makes sense in theory anyway, as long as the compiler doesn't optimize anything, which is a big if...) but MUL+MADD would never, ever co-issue. Instead, the performance was identical to what you would expect if the MULs co-issued with other MULs, but never with MADDs. This is because the MADD collected 4 operands per clock cycle, no matter if they were all the same or not, so you never have any time left to fetch extra operands for the SFU.
Considering the results on G86, my guess is they're implementing something to keep previous registers somewhere so they don't need to be refetched from the traditional register file too fast. For example, it would be perfectly possible to not write results to the register file if they are used in the next instruction, and not afterwards. Then you can just chain the two, and save TWO operand operations (one read, one write...)
There are a lot of possibilities however, so I won't try guessing which one it really is. Fact of the matter is, however, that it doesn't surprise me much that the MUL is fully exposed on G86: as it is, for the Pixel Shader on G80, it really does look like only a register file limitation. A smarter compiler could probably reach >= 1.0x MUL in a bunch of cases, but could still never dream of reaching G86's average MUL throughput.
P.S.: Do you know what I would like to see in G9x? MUL *or* ADD in the SFU. MADD might not be desirable for datapath reasons, but the ADD FPU itself should be very cheap. This certainly would make that unit even more interesting...
P.P.S.: Sorry for the long post, just running some benchmarks in the background and I thought it was a good time/thread to spill everything I've speculatively concluded about this so far, sooo...!