Some new NVIDIA patents, likely related to G8x

nAo

Nutella Nutellae
Veteran
First patent is about grouping and scheduling threads and on how is possible to save registers space sharing registers between a group of threads defined as "buddy threads".
They split (at compile time) shaders in sub blocks that have long latency operations (such a texturing ops) across blocks boundaries, then all registers used within a block that don't need to be saved at the end of a block to retain its state later are allocated once for all the buddy threads and shared.
System and method for grouping execution threads

Second patent is obviously G8x related and yes, the SFU unit can perform MULs :)
It contains details about threads and instructions scheduling:
System and method for processing thread groups in a SIMD architecture

Marco
 
Thanks for linking these patents Marco! :) I know you pointed me to them earlier, but I'll admit I hadn't read the one on shared registers, and that's a really smart idea IMO. I wonder what their average savings in real-world shaders are...

the throughput of 2nd MUL( which perform on sfu ) on g80/g83 is 4 cycle, and 1 cycle on g86 :)
As indicated in that patent, there are two theoretical limitations:
- If the batch size (32 for PS, 16 for VS) is equal to the warp size (16), then either the ALU or the SFU will idle. If the batch size is twice the warp size, then the scheduler has enough time to co-issue both instructions.
- Only four register operands may be collected at a time, and these need to be shared between the ALU and the SFU.

G80 seems to have serious limitations regarding reusing registers, and I doubt that's just a driver limitation. G86, on the other hand, probably has some new hardware to reuse (cache? keep last used? FIFO?) registers more intelligently so that the two pipelines may be co-issued more efficiently (and easily).

I actually did get 1.5x the MUL rate on G80, with just "r0=r0*r0;r1=r1*r1;", see this thread: http://forum.beyond3d.com/showpost.php?p=1011976&postcount=36

My conclusion was that the hardware could reuse the same register for the SFU twice, so that it only had to read r1 once there, but that it was not capable of reusing registers in the ALU, so it had to read r0 twice. This implied that three operands per processor had to be collected for the ALU, leaving only one per clock for the SFU/MUL. As such, the SFU/MUL could only issue one instruction every two clock cycles, resulting in 1.5x throughput.

ADD+MUL could also reach 1.5x if you were smart about it iirc (I think I did that, but I'm not sure anymore; it makes sense in theory anyway, as long as the compiler doesn't optimize anything, which is a big if...) but MUL+MADD would never, ever co-issue. Instead, the performance was identical to what you would expect if the MULs co-issued with other MULs, but never with MADDs. This is because the MADD collected 4 operands per clock cycle, no matter if they were all the same or not, so you never have any time left to fetch extra operands for the SFU.

Considering the results on G86, my guess is they're implementing something to keep previous registers somewhere so they don't need to be refetched from the traditional register file too fast. For example, it would be perfectly possible to not write results to the register file if they are used in the next instruction, and not afterwards. Then you can just chain the two, and save TWO operand operations (one read, one write...)

There are a lot of possibilities however, so I won't try guessing which one it really is. Fact of the matter is, however, that it doesn't surprise me much that the MUL is fully exposed on G86: as it is, for the Pixel Shader on G80, it really does look like only a register file limitation. A smarter compiler could probably reach >= 1.0x MUL in a bunch of cases, but could still never dream of reaching G86's average MUL throughput.

P.S.: Do you know what I would like to see in G9x? MUL *or* ADD in the SFU. MADD might not be desirable for datapath reasons, but the ADD FPU itself should be very cheap. This certainly would make that unit even more interesting...

P.P.S.: Sorry for the long post, just running some benchmarks in the background and I thought it was a good time/thread to spill everything I've speculatively concluded about this so far, sooo...! :)
 
Why would a MAD need to collect four operands, Arun?
I'm using 'collect' in the loose sense of the word (results need to be 'collected' too), because that's actually how the patent uses it:
Patent said:
For each cycle of the T clock, each of the operand collection units 216, 218 is able to collect 16 sets of operands.
Patent said:
Thus, each of the accumulators 226, 228 collects 16 sets of execution results before it writes back to the register file 214.
I would thus presume that the G80's register files are single-ported and share the same ports for read and write operations.
 
I'm using 'collect' in the loose sense of the word (results need to be 'collected' too), because that's actually how the patent uses it:
And I was referring to operands. The operand collection units collect 16 sets of operands per T clock which corresponds to 8 sets per H clock. Each set being 2-3 operands for the MAD unit and 1 (2 for MAD is not mentioned) for the SFU.

The accumulators combine two 8-wide results over two H clocks to one 16-wide result per T clock.

So the registers are 16-wide and you can read up to four and write two per T clock.

I would thus presume that the G80's register files are single-ported and share the same ports for read and write operations.
I don't see any indication of that in the patent.
 
http://courses.ece.uiuc.edu/ece498/al/lectures/lecture8-9-hardware.ppt

page 24

Scalar MAD pipe
FMUL,FADD,FMAD
integer ops, conversions
One instruction/clock

Scalar SFU pipe
RCP,RSQ,LG2,EX2,SIN,COS
one instruction/4 clocks
also supports FMUL, MOV

TEX pipe (external to SM, shared by all SM’s in a TPC)
LD/ST pipe
thread register spill to memory, used for indexable registers
CUDA has both global and local memory access through LD/ST



:)

others:

http://courses.ece.uiuc.edu/ece498/al/Syllabus.html
 
G86, on the other hand, probably has some new hardware to reuse (cache? keep last used? FIFO?) registers more intelligently so that the two pipelines may be co-issued more efficiently (and easily).
Why not just a different configuration of the stated operand fetch hardware, given there's just one cluster to keep working?
 
Xmas said:
And I was referring to operands.
And so am I, but my definition of 'operand' is once again more loose than yours, because that's how I am interpreting the patent's use of the term 'operand'. Here is a further illustration that they consider *results* from instructions as being 'operands':
In step 616, the collected operands are advanced down the execution pipeline and operated on by multiple pipe stages of the issued instruction. Steps 614 and 616 are carried out continuously until all of the operands collected in step 612 have exited the execution pipeline. While steps 614 and 616 are being carried out, the accumulators 226, 228 collect operands exiting the execution pipelines 222, 224 and write back to the register file 216 every other H clock (i.e., half convoy at a time).
You could intepret that as implying that they actually 'remove' the data from the register file when it is read, and then write back all the read operands when the execution pipelines are done. That seems wasteful to the extreme however, IMO, so unless I'm missing something...
Xmas said:
I don't see any indication of that in the patent.
Indeed, that is only my conclusion based on the above premise: if they can 'collect' 16 operands per clock cycle, and that this represents read+write and not just read, then it implies that the register file has a shared read+write port, or that there is another limitation resulting in an identical behaviour.

Of course, if you disagree with that premise and believe that 'collecting' 16 operands only refers to reading in that patent, then the patent doesn't imply anything whatsoever on the subject, and my point is moot. The primary quote going against my premise is this, as far as I can tell:
Patent said:
Typically a set of operands associated with an instruction of the MAD type includes two or three operands, and a set of operands associated with an instruction of the SFU type includes one operand.
However, in my mind, this sentence is put in the context of the first phase of collection, which is reading. That doesn't mean it is the only phase of collection for an instruction. And it doesn't mean that 3 operands for MADD + 1 operand for the SFU cannot be a limitation. And clearly, that '1 operand for the SFU' doesn't even consider the case where it is working on a FMUL. Of course, I could be wrong on all this, and your interpretation of the patent might be the correct one.

Rys said:
Why not just a different configuration of the stated operand fetch hardware, given there's just one cluster to keep working?
Sure, but that seems more expensive to me, and probably not much less efficient. Who knows, though...
 
Indeed, that is only my conclusion based on the above premise: if they can 'collect' 16 operands per clock cycle, and that this represents read+write and not just read, then it implies that the register file has a shared read+write port, or that there is another limitation resulting in an identical behaviour.
Again, I don't see how you come to that conclusion when the requirements are 4 reads and 2 writes per T cycle.

However, in my mind, this sentence is put in the context of the first phase of collection, which is reading. That doesn't mean it is the only phase of collection for an instruction. And it doesn't mean that 3 operands for MADD + 1 operand for the SFU cannot be a limitation. And clearly, that '1 operand for the SFU' doesn't even consider the case where it is working on a FMUL. Of course, I could be wrong on all this, and your interpretation of the patent might be the correct one.
It's the only phase of collection from the register file. Clearly 1 operand for the SFU is a limitation for MUL, plus there could certainly be other limitations like bank conflicts.
 
Again, I don't see how you come to that conclusion when the requirements are 4 reads and 2 writes per T cycle.
The requirements for peak throughput are indeed 4 reads and 2 writes. However, I'm not aware of anyone who has ever managed to hit this peak throughput on G80, and I certainly haven't. Furthermore, similar register file limitations have plagued G71, G70, NV40, NV35 and, IIRC, NV30 too.

NVIDIA's design philosophy clearly never has been to make sure their architecture could hit its peak throughput in all, let alone most, cases. This certainly *could* have changed for G80, but that doesn't mean it did. As such, I'm not sure what your point is. Or is there a sentence in the patent that says the architecture is capable of 4X reads and 2Y writes simultaneously? If so, I must have missed it, heh.

Xmas said:
It's the only phase of collection from the register file.
Indeed, but once again, I don't think that is being pointed out explicitly in the patent:
Patent said:
For each cycle of the T clock, each of the operand collection units 216, 218 is able to collect 16 sets of operands.
If there is a sentence that says it is actually capable of simultaneously fetching 16 sets of operands and writing 8, then once again, I must have missed it and I'd love to be pointed to it.

Xmas said:
Clearly 1 operand for the SFU is a limitation for MUL, plus there could certainly be other limitations like bank conflicts.
Well, if the read operand could be reused for r0*r0 (which is a special case, I will admit) and the writing was separate, then I should be able to hit 2.0x the MUL throughput of the main ALU, while I can't get more than 1.5x. Of course, it might not be able to do that, or the driver might not be exposing that, or there might indeed be other limitations such as bank conflicts, although I'm skeptical about the latter. And this also doesn't explain the 1.25x/1.33x cases too nicely...
 
Operands might also be constants (constant buffer). Or in the case of CUDA code, slots in PDC. So, not all operands need to come from the register file.

Jawed
 
The requirements for peak throughput are indeed 4 reads and 2 writes. However, I'm not aware of anyone who has ever managed to hit this peak throughput on G80, and I certainly haven't.
Depends on what you consider peak throughput. MAD + MUL would require 5r/2w. MUL + MUL would be 4r/2w, but also requires the SFU to be able to take two inputs per clock. If MAD + 4 clock SFU can be co-issued, it would be something like 3.25r/1.25w.

So what exactly is your suggestion? 2 read, 1 write, 1 shared?

I guess it's also complicated a bit by the latency between reads and writes for the same instruction pair.
 
My sugestion is simply that the register file is a number of single-ported banks that together can read *or* write 4x16x32-bits/clock cycle per effectively 16-wide ALU due to the double-pumping. The collector needs four clock cycles per instruction to get all the operands it might need and write (a) result(s) from a previous instruction.

It seems to me that extending this to, say, 5x16x32 or 6x16x32 would require the warp size to be a multiple to that. I haven't really thought about it enough to be sure though, so I could just be missing something obvious. Of course, extending it to 4x16x32 R+W instead of R or W wouldn't have that problem...
 
I was going through some of the previous posts in other threads regarding the SFU and interpolator tasks.

Isn't the best way to do interpolation simply to determine the weights for each vertex once per pixel? It's 4 add and 2 muls after that for each scalar that you want to interpolate. The idea of interpolating (A/w) and dividing by interpolated (1w) seems silly to me.

Actually, looking at it this way, I can see why ATI wouldn't see NVidia's patent here as a big deal. You only need two MULs (less than a SF, right?), the two weights are always between 0 and 1 (maybe allowing multiplier savings), and all the values that need to be accessed are outside the usual pixel shader registers.

The last point seems to be the biggest, because data routing and registers seems to be a much bigger contributer to die size than the transistors specifically devoted to math. NVidia's multifunction interpolater is claimed to be around the area equivalent of 8000 full adders, so 128 of them is what, ~15M transistors? Even considering pipelining, separating the special function unit seems like it would only add a couple percent to the total die size.

Anyway, I'm gabbing on about an old patent. Carry on guys...
 
Last edited by a moderator:
Isn't the best way to do interpolation simply to determine the weights for each vertex once per pixel?

You mean, like barycentrics?

The idea of interpolating (A/w) and dividing by interpolated (1w) seems silly to me.
Well, if you don't care about distortions in your images, you can skip the division (and then multiplication) by w. Read up on perspective correction. The weights don't interpolate linearly in screen-space, so you need to correct for that.
 
No no, I mean perspective correct weights. The interpolated value is a linear combination of the values at each vertex. Once you calculate the weights, you use these same values for every attribute.

After reading more on the subject (especially patents) it looks like ATI/NVidia do indeed implement it in the way I'm talking about.

I just find it weird that people describe the operations in such a complicated way. In this paper, for example, equation 10 executed once per pixel and applied to the trivial equation 6 for any attribute seems so much more useful to me than equation 15. nAo was once suggesting to me that triangle setup is a task whose load depends on the number of attributes and pointed me to this paper. I don't see why you'd go through all that for each attribute when you just have to evaluate the edge function (only two components) in section 5.1 once per pixel to get the weights. No 3x3 vector-matrix multiplication per attribute is required.

"Interpolated (A/w) divided by interpolated (1/w)" implies you need to determine A/w for each vertex, then for each pixel you interpolate it linearly and multiply by the reciprocal of interpolated (1/w). I say just figure out weights a and b once per pixel, and then you just do a*(A1-A0)+b*(A2-A0)+A0 for attribute A, a*(B1-B0)+b*(B2-B0)+B0 for attribute B, etc. No per attribute cost for each vertex, and a smaller per pixel cost to boot.

The pair of weights (a,b) are found by having an imaginary attribute of U0 = (0, 0), U1 = (1/w1, 0), U2 = (0, 1/w2). Interpolate that, divide by interpolated (1/w), and you got your weights that are good for as many attributes as you want.

Anyway, I'm really only complaining about the way people are describing this operation. Mathematically it's equivalent, but I for hardware discussions it seem odd to describe it that way. I'm pretty sure all hardware does implement it in the way I mentioned.
 
No no, I mean perspective correct weights. The interpolated value is a linear combination of the values at each vertex. Once you calculate the weights, you use these same values for every attribute.
You are suggesting using barycentric coordinates. There are advantages and disadvantages to using those.
 
Can you explain any further than that?

I know linear interpolation can be reduced to a couple ADDs per pixel, but that doesn't parallelize very well, and from what I've seen the IHV's agree (hence the interpolator and SF having similar calcs).

Reducing per attribute vertex load seems more important than a small constant pixel load. I wonder if the IHVs actually do differ in opinion here. It may explain differences in setup rate on the consoles, at least. Once upon a time additional attributes meant additional vertex shader time so the longer setup time had no impact. Now that the unified shaders scream through most vertex shaders, setup time matters again.
 
Back
Top