Question about a quote from the B3D G80 article, and nvidia FUD doc

On the topic of MUL through the SF pipeline, I think there must be something missing from the patent diagram Jawed posted. Assuming interpolation is done by evaluating plane equations as Bob hinted, and assuming you're doing perspective-correct interpolation (required by DX9, default in DX10 unless the shader chooses otherwise on a per-vector basis) you need to do for each pixel:

value = (A*x + B*y + C) * interp_w

The final multiply is for perspective correction. If interpolation throughput has been measured at 1 clock, then multiply-by-w must be done as part of the interpolation (in the SF unit) rather than a second dependent operation in the MAD unitBefore I saw the SF diagram Jawed posted I'd assumed it was this perspective-correction MUL that was being used for SF-MUL.

(Here's how perspective-correct interpolation works, for anyone not familiar with it:

The vertex shader outputs float4 position and whatever other attributes it wants. Position XYZ and all the other attributes are divided by the pos.w value before being used for clipping, rasterization, and plane equation setup. So evaluating plane equations actually gives you the interpolated value of attr/pos.w. This is multiplied by the interpolated pos.w value (what I called interp_w above) to get the final interpolated attr value. The interpolated pos.w is the reciprocal of the value obtained from evaluating a plane equation built from the per-vertex 1/pos.w.)
 
If interpolation throughput has been measured at 1 clock, then multiply-by-w must be done as part of the interpolation (in the SF unit) rather than a second dependent operation in the MAD unit...
Not necessarily. While the dependent MUL executes in the MAD units it can already evaluate plane equations for another batch (thread). Every clock cycle you can get perspective correct interpolants, so there's a throughput of 1. You could also think of it as attaching the MAD unit pipeline to the SFU pipeline; it's a longer pipeline that does more work per clock.
 
Well spotted. This is what slide 10 says:

Flow of data for simple per-pixel perspective correct texture lookup and blending:
  • InterpAttr 1/w
  • RCP to form per-pixel w, needs to be ~1 ulp
  • InterpAttr S/w and T/w
  • Multiply S/w and T/w by per-pixel w to form S and T
  • Texture lookup based on S and T
    InterpAttr R/w, G/w, B/w
  • Multiply by per-pixel w to form per-pixel versions of R,G,B
  • Use FMAD’s to blend texture R,G,B with per-pixel interpolated attribute R,G,B...
Continued shading effects using FP operations
As far as I can tell this implies that rcp(1/w) will be produced, per pixel, at the start of the shader, and the result held in a temporary register for all further interpolations.

armchair_architect, if you'd like the PDFs that this diagram comes from (they're not on the web anymore as far as I can tell) then PM with your email address. I marked-up this diagram with the green and red to show the interpolation paths - sadly I missed a few bits here and there...

So, anyway, this has a bigger impact on ALU throughput than I was taking into account. Each interpolation results in a dependent: Interpolate-MUL. Hmm. At least there's no need to schedule these instructions contiguously, but obviously if they're not scheduled this way, then it means the interpolation consumes an extra register.

The MUL presumably happens in the main MAD ALU, because the throughput in the SF unit would be too low (1/4 rate).

Jawed
 
I think you should try writing a MUL (or MAD)-only shader that interpolates different attributes at each instructions. The SFU really will do a MUL at the rate of 1 per clock per thread, largely independent of what happens in the MAD pipe.
 
Not necessarily. While the dependent MUL executes in the MAD units it can already evaluate plane equations for another batch (thread). Every clock cycle you can get perspective correct interpolants, so there's a throughput of 1. You could also think of it as attaching the MAD unit pipeline to the SFU pipeline; it's a longer pipeline that does more work per clock.

Yeah, the way I worded that was just wrong :oops:. I thought I'd seen some analysis somewhere that showed the entire perspective-correct interpolation had the same cost as a MAD, suggesting that the SF+interp unit was able to do a MUL at full speed. But thinking through what it would take to demonstrate that I think I probably just misinterpreted something else. Oh well...
 
I think you should try writing a MUL (or MAD)-only shader that interpolates different attributes at each instructions. The SFU really will do a MUL at the rate of 1 per clock per thread, largely independent of what happens in the MAD pipe.
I'd love to try all this but I don't have any DX10 card yet. Too expensive for me at the moment (and I don't want to replace my GeForce 7900 GT with a GeForce 8600).

Anyway, are you saying there's one extra MUL unit per quad, or four? And this is not drawn in the diagrams of the SFU?
 
Just a wild thought: Is it possible perspective correction is disabled for 'flat' triangles? That's a trick I used years ago to avoid the expensive division.

I can imagine this could work on hardware as flexible as G80 as well. It's possible to detect in the triangle setup that the gradients of w are small enough to ignore. Scenes with lots of tiny triangles that don't need perspective correction could really benefit from it.

So, anyone with a G8x here who can test this?
 
As far as I know the current idea was that the SFU's can perform a co-issued MUL, but only of limited precision.

Okay so r u saying it is performed(MUL) using the 4 least significant pipes which Jawed talked about, this 'co-issued MUL' word seems to indicate this.

Thank you
 
Okay so r u saying it is performed(MUL) using the 4 least significant pipes which Jawed talked about, this 'co-issued MUL' word seems to indicate this.
Sorry, I really don't know. I was just just a rumor I had picked up, and probably misinterpreted.
 
in theory , each G80 ALU could process MADD + MUL per clock cycle possibly totalizing 256 stream processors , right ?

but then , this MUL function isn't operable considering shading power , true ?


how is that ?
 
I'm not sure it would be 256 SPs with the MULs included. More like 128 larger SPs. Secondly I'm not sure you can use just the MULs. You would still have to use the MUL and then pass the result through the MADD to output it. For instance you'd need a multiplication operation followed by some usage of the MADD. So (x*x)*y+z. On top of that the MUL would need to be free and if it's tied to the SF/interpolators it would only be available after the first couple instructions and when no special functions occurred in the code. My understanding is that the MULs are used to feed data into the MADDs under certain circumstances.

I really wish Nvidia had something similar to AMDs Shader Analyzer to see just how it was compiling the code.
 
For those lost, here are links:
slides
paper(also IEEE if you / your school/company are a member).

Now the basic functions are very well laid out in the slide. Either the unit can calculate 4 interpolated values, or it can calculate 1 special function value.

Now how you would get a straight A*B MUL out of this SFU is quite screwy. You have two 24x17 multipliers (they're booth coded, but that's ignored when talking about the size) so if you could have each calculate half of the result and add them together (the adder's there). Problem is this requires some extra shifters and a new way of feeding arguments in. If it's there I'm not sure why they bothered considering it's 1/4th speed.

Edit: Also I like how the slides specifically say "For unannounced future product", "Modified shader microarchitecture", "For use in both Vertex and Pixel Shaders". This is back when everyone was arguing that G80 was not unified...why on earth would your VS need an interpolater? :)
 
Rufus: What you're probably missing is that those papers don't take perspective correction into consideration. You need to multiply by (1/pos.w) after all this, and you can't really get away from FP32 there as far as I can see, although I could be wrong.

I actually got 1.5x the normal MUL throughput with a program minimizing the number of register operands on 101.41 recently, and I suspect I could get to 2.0x if the driver (or hardware...) only fetched two operands on the main pipeline's MUL when reusing the same register.

Hard to say if that's the problem, though. Either way, it certainly doesn't look like you could get 2.0x most of the time under normal workloads unless that driver (which is, to date, the only one exposing the MUL at all on G80 AFAIK) is misbehaving badly.

EDIT: And regarding whether that doc points at an unified architecture - in that timeframe, the architecture I had in mind for G80 was tha tyou could run the VS on the PS, but not the other way around, and that when vertex fetches are used you couldn't use the VS pipelines at all. So that collided quite well with that, but of course it was quite far from the truth! ;)
 
Rufus: What you're probably missing is that those papers don't take perspective correction into consideration. You need to multiply by (1/pos.w) after all this, and you can't really get away from FP32 there as far as I can see, although I could be wrong.
I assume you mean multiply by pos.w. Yes *pos.w would be FP32, but why would you implement that in the SFU? The entire point of the unit is to minimize duplication, so why would you add another 24-bit multiplier that is only used by the interp half and not the special function half? Seems more logical hand off an un-corrected result and let the SP ALU do the correction, but I could be wrong. Could you test this somehow (not sure how)?

I actually got 1.5x the normal MUL throughput with a program minimizing the number of register operands on 101.41 recently, and I suspect I could get to 2.0x if the driver (or hardware...) only fetched two operands on the main pipeline's MUL when reusing the same register.
Erm, 1.5x?? That would mean the SFU was spitting out 2 MULs / cycle (aka is 1/2 speed instead of 1/4th). Now back to the drawing board of how I thought it worked.
 
I assume you mean multiply by pos.w. Yes *pos.w would be FP32, but why would you implement that in the SFU?
It is multiplied 1/pos.w, and that is calculated by the SFU (via rcp) at the beginning of the program. This can be proved by writing a SFU-limited program where attribute inteprolation is requested, and another where it is not; the one without attribute interpolation will be faster, because there will be one less RCP.

As for why you'd want to implement that in the SFU. I pondered the same initially, and someone pointed out to me that SFU operations were less frequent than attribute interpolation, at least in current workloads (dominated by the pixel shader, not ridiculously long pixel shaders that do tons of things unrelated to the initial attributes, etc.) the interpolation is probably much more frequent than SFU functions.

In theory, you could imagine an architecture where the MUL is available for general shading when you're doing SFU work, or when the unit is completely idle. This would clearly minimize wastage. It's hard to say if that's how it works in G80 though, especially so given that 101.41 doesn't seem to be exposing the functionality in the most logical of ways. If someone told me that the driver exposed it as much as any driver ever will, and that the limitations are HW-related, then I might bother investigating it further...

Erm, 1.5x?? That would mean the SFU was spitting out 2 MULs / cycle (aka is 1/2 speed instead of 1/4th). Now back to the drawing board of how I thought it worked.
It doesn't have to mean that. If the register file allows you to effectively read/write 4 operands/clock (so, 3 read+1 write, for example) then this could become a limitation.
The shader exposing 1.5x throughput is basically this. I'm amazed this doesn't get simplified through special functions, but heh:
...
r0 = r0*r0;
r1 = r1*r1;

r0 = r0*r0;
r1 = r1*r1;

r0 = r0*r0;
r1 = r1*r1;
...
If the hardware is reading 2 registers for the first MUL, and writing one too, then that only leaves one register access for the second co-issued MUL. Assuming the hardware/driver is smart enough to reuse the same register instead of reading it twice there, it would explain the 1.5x throughput. (reading the register+writing it would take 2 clocks...)

Interestingly, "r0 = r0*r2; r1 = r1*r2;" only runs at 1.25x. I did have a program running at 1.33x once, but I cannot quite remember exactly how it works. Also, "r0 = r0*r0+r0; r1 = r1*r1;" runs at 1.25x, possibly indicating that the MUL cannot be co-issued with a MADD (unless the HW could be told to not reread identical registers), and that the MADD and MUL parts of the program are separated by the compiler and that it is co-issuing MULs together.

I'm sure there are other potential explanations, and this one isn't even perfect (although I'd argue that's probably because the compiler isn't perfect either). If CUDA exposed the MUL, which I'm not sure of (Vista x64 distribution, pretty please?), then it would be possible to read or write one of the operands from shared memory, and see if that allows 2.0x - under the graphics APIs, another way to fetch an operand from somewhere else might be via constant buffers, although you obviously cannot write to that.
 
Ha, so we've been looking in the wrong place all along and G80 is register bandwidth limited on these "awkward" synthetic shaders?

ARGH.

Jawed
 
Ha, so we've been looking in the wrong place all along and G80 is register bandwidth limited on these "awkward" synthetic shaders?
That's what I'm thinking right now, and it's also my alternative explanation for that "89% shader" :) But as I said, I'm really not sure, so other theories are far from impossible. For all we know, there might be an obscure hardware bug that complicates things further too, heh. And no, I'm not saying there is, but it is worth pointing out that we just don't know anything reliably enough at this point.

Also, it would seem rather strange to me that it would be 4 (read+writes). It would make more sense for it to be 4 reads + 4 writes IMO, because other subsystems (such as the TMUs) also seem to be able to write to the register file. I guess that might have its own path to the register file which couldn't be used by anything else, though. Sigh...

What would really help is if I had a G86 (or if someone else did and could run some tests on it, I guess, although that's not quite as practical), since it apparently exposes a second MUL much more reliably right now. I would expect that to be hardware differences, so it wouldn't tell us exactly how G80 works, but it would be a step in the right direction. Sadly, I was supposed to get a G84 and a G86 to review, but I only got the G84! :) Ah well!
 
That's what I'm thinking right now, and it's also my alternative explanation for that "89% shader" :) But as I said, I'm really not sure, so other theories are far from impossible.
The fact you are unable to run any CUDA tests does get in the way. Reading operands from PDC or constant cache really should isolate register bandwidth.

Although one might argue that all operands are "cached" somewhere close to the ALU pipeline, after they've been fetched from the register file, constant cache or PDC - so what looks like a register bandwidth limitation is actually an "operand-gather" limitation.

Also, it would seem rather strange to me that it would be 4 (read+writes). It would make more sense for it to be 4 reads + 4 writes IMO, because other subsystems (such as the TMUs) also seem to be able to write to the register file. I guess that might have its own path to the register file which couldn't be used by anything else, though. Sigh...
Texture-result writes to the register file are a peculiarity, because texture throughput is slow, the clock rate is significantly lower than the ALU pipeline rate. The latter presumably determines the register file access-rate - although, to be fair it could be that core clock rate determines register file clock rate (what a ghastly thought). Anyway, the peak rate at which the TMUs can write to the register file would appear to be <0.5x the ALU pipeline's rate.

Hmm, but the TMUs can only produce 4 texture results per core clock (they're TA-limited, effectively). So TMU register file writes are actually <0.25x ALU writes.

Also you have to consider the case of writing the result of a MAD and an SF. Both units produce 1 result per clock. It's just that SF results can't produce a "coherent" write to all result registers per clock, because only 2 pixels out of 8 get an SF result.

So SF's results added to MAD results requires 1.25 register writes per clock. TMU results added in makes that 1.5 register writes per clock, maximum.

What would really help is if I had a G86 (or if someone else did and could run some tests on it, I guess, although that's not quite as practical), since it apparently exposes a second MUL much more reliably right now. I would expect that to be hardware differences, so it wouldn't tell us exactly how G80 works, but it would be a step in the right direction. Sadly, I was supposed to get a G84 and a G86 to review, but I only got the G84! :) Ah well!
G86 has more TAs per cluster than G80, doesn't it?

Jawed
 
D'oh, forgot that TMUs produce 4x scalars per clock, whereas the ALU pipeline produces 1 scalar.

So that means <2.25x scalar register writes per ALU clock.

Jawed
 
Back
Top