instructions/operations per clk

demalion · May 7, 2004

Why are we discussing the IPC in something that resembles marketingese so much?

Tridam: which parts of your NV40 description are based on testing, and which are based on Kirk's comments? What testing and what comments? What modifiers are supported for the issued units...all the PS 1.4 modifiers? Wouldn't it be important to clarify this information?

The term "instructions" seems like it is being abused to me, especially in a context for comparison of different architectures. Normalization should count as the equivalent of 3 instructions (1 "complex", 2 "simple") for comparison. "Modifiers" should be mentioned as distinct from more general processing ability. Where is the consideration of what can actually be effectively be completed in the same clock cycle...the current discussion seems to assume all ops take the same amount of clock cycles, but is this the case? What impact does a texture op have and for how many clock cycles will it affect IPC? Is it just 1 clock cycle of IPC impact? What about value handling characteristics...as one example: did the NV3x really have a penalty for using constants, and, if so, did the NV40 get rid of it?

Going by the boost in IPC that seems to occur here, the normalization looks to impact execution of approximate equivalence to occupying the "complex" ALU. This seems quite validly stated as getting significant "free" partial precision operations, but it doesn't correspond to the impression of "free" being put forth. Did something Kirk said specify exactly what "free" means for this? Is there some other testing result that gives a different indication than I took away from that link, or another interpretation or a correction of what I came up with in it?

What set of modifiers can the NV40 perform in conjunction with general ops?

ATI has claimed, in my understanding, to be able to do independent ADD and MUL in addition to the full ALUs on the R3xx. Was this discounted to only be described as a modifier? Or is the presumption simply that the discussed "NV40 modifiers" are representing the same functionality?
Has anything changed with regard to this functionality on the R420, or what ATI claims (or is there some error in my understanding of what they claimed)?

...

For my current understanding of the above:

NV40:
Very good at maintaining 1 IPC throughput via "complementary" ALUs to somewhat counter failing to "protect" complex ops IPC throughput from being used for texture ops.
Can boost IPC by complete two part coissue flexibility and a set of modifiers, for each "complementary" ALU.
Can boost IPC, even with modifiers, when a "complex" and "simple" op are paired.
Can boost IPC, in partial precision, by performing the equivalent of 3 ops in one clock cycle (tying up, to my understanding so far, the "complex" ALU).

R420:
Very good at maintaining 1 IPC throughput by having a "complete" ALU and protecting its IPC throughput from being used for texture ops.
Can boost IPC by common case (but not complete) two part coissue flexibility and a set of modifiers (AFAIK, PS 1.4 modifiears).
Can boost IPC, as an alternative to modifiers, for general MUL/ADD operations (subset of "simple ops").
This assumes no changes from what was discussed about the R3xx in this respect...is this true?
...

Even if my understanding is accurate, this isn't the complete picture. Other questions that could be usefully answered include things like: where does SIN/COS fit in now?

I think this thread would be a good place to go into detail about any specific info we have with regards to the above.

Luminescent · May 7, 2004

According to Digit-Life's NV40 report, Nvidia removed SIN/COS specific hardware in lieu of 1D table lookups which they claim work more efficiently and cost less transistors.

Here is the excerpt:

Digit-Life said:
Hardware calculation of SIN and COS values was extracted from the new NVIDIA architecture: it was proved that transistors used for these operations were spent in vain. All the same, better results in terms of speed can be achieved when accessing by an elementary table (1D texture), especailly considering that ATI doesn't support the mentioned operations.

Xmas · May 7, 2004

If you can read german, this discussion may be interesting to you. And if you don't, you can still have a look at the diagram.

It's guesswork, of course. The position of the SF-unit might be wrong, but it is there to indicate that you can use the output of the SFU as input to the MUL units, meaning you can do a DIV in one clock in SU0 (RCP + MUL). Also, a TEX does not mean you cannot use the SFU/MUL. However the texture coordinates are passed through SU0, so you can only use SFU/MUL to modify the texture coordinates, not for a separate calculation.

My guess is that TEX always uses three components so the machine code doesn't change when you use cube maps/3D textures. However, there are PS3.0 instructions which pass texture LOD as w component, so maybe there's a trace missing in the diagram. A scalar op should be possible in parallel to an ordinary TEX.

SU1 should be able to do the following per clock:
MAD4
DP4
MAD3 + MAD1
DP3 + MAD1
MAD2 + MAD2
MAD2 + DP2
DP2 + DP2
DP2ADD + MAD1

Output modifiers supported are d2/d4/d8/m2/m4/m8 for any instruction (in both SU0 and SU1), and all those plus bx2 on tex instructions, meaning the TMU has an extra output scaling unit.

Luminescent · May 7, 2004

The information given to me by Demirug seems to confirm what you say, Xmas.

Xmas said:
However the texture coordinates are passed through SU0, so you can only use SFU/MUL to modify the texture coordinates, not for a separate calculation.

I'm not sure if you could confirm this, but I heard that in the case of a texture coordinate instruction that does not require both SFU and MUL, the free unit could be used for another instruction.

What about ddx/ddy, what kind of unit would such an instruction call for?

Xmas · May 7, 2004

Luminescent said:
The information given to me by Demirug seems to confirm what you say, Xmas.

That's no surprise

Xmas said:
Xmas said:

However the texture coordinates are passed through SU0, so you can only use SFU/MUL to modify the texture coordinates, not for a separate calculation.

Click to expand...

I'm not sure if you could confirm this, but I heard that texture coordinates instructions that do not require both SFU and MUL, allow the free unit to be used for another instruction.

I have no hardware to test it, but the way I see it you can either do
SF - MUL3 - TEX sequence plus a MUL1 in parallel
MUL3 - TEX plus a SF - MUL1 in parallel
SF - MUL2 plus a MUL2 in parallel
SF - MUL4
And you can drop any instruction from the above mentioned variants.

So the texture coordinates that are passed to the TMU are always taken from the output of three MUL units. The output of the SFU can be used as input for those MUL units, or as input for the spare MUL unit.

What about ddx/ddy, what kind of unit would such an instruction call for?

It's basically just a SUB, but it only needs to be done once per quad (or twice?), and needs the registers on a quad level. I assume it's done per pixel for simplicity, in SU1.

Luminescent · May 7, 2004

In regards to SU1, does it have any special function hardware?

Xmas · May 7, 2004

AFAIK, no. It's just four MUL followed by four ADD which are cascaded so you can use them to either add component-wise or calculate a component sum for dot products.

cho · May 7, 2004

i have do some simple pixel shader test on nv40 with 60.72.

http://www.gzeasy.com/sparticle.asp...%E2%CA%D4%5D+GeForce+6800+Ultra&offset=11

ps_2_0

dcl t0

add r0, t0, c0
add r0, c0, r0

mov oC0, r0

results: 3183.477 M-Pixel/s, ~0.5 pixel/cycle per pipeline

ps_2_0

dcl t0

mul r0, t0, c0
add r0, c0, r0

mov oC0, r0

results: 6336.755 M-Pixel/s, ~1 pixel/cycle per pipeline

Dave Baumann · May 7, 2004

Luminescent said:
According to Digit-Life's NV40 report, Nvidia removed SIN/COS specific hardware in lieu of 1D table lookups which they claim work more efficiently and cost less transistors.

Here is the excerpt:

Digit-Life said:

Hardware calculation of SIN and COS values was extracted from the new NVIDIA architecture: it was proved that transistors used for these operations were spent in vain. All the same, better results in terms of speed can be achieved when accessing by an elementary table (1D texture), especailly considering that ATI doesn't support the mentioned operations.

Click to expand...

Heh. I seem to remember Andy suggesting this as something ATI could do in their optimiser if they wanted to.

LeStoffer · May 7, 2004

Luminescent said:
According to Digit-Life's NV40 report, Nvidia removed SIN/COS specific hardware in lieu of 1D table lookups which they claim work more efficiently and cost less transistors.

Hmm, I seem to remember that SIN/COS has to be native (not macro) in SM 3.0? Maybe that was only for VS 3.0 or maybe I just remember wrong?

Luminescent · May 7, 2004

Cho, your results seem to confirm Xmas' data, indicating that there is only 1 ADD unit per pixel pipe in NV40, confirmed in test 1; test 2 confirms that there are independent MUL and ADD units. It would be interesting to run a test which includes two MUL ops and one ADD, as there are two MUL units per pipeline, one in each ALU.

andypski · May 7, 2004

LeStoffer said:
Hmm, I seem to remember that SIN/COS has to be native (not macro) in SM 3.0? Maybe that was only for VS 3.0 or maybe I just remember wrong?

In the pixel shader (or a vertex shader with arbitrary texture lookup capabilities) it would be a fairly irrelevant distinction - if you have an architecture that supports infinite texture indirections you can easily support SIN/COS through a texture lookup, and provided you meet the accuracy requirements then who would be any the wiser as to whether it was a native instruction or not?

The only issue is that you would need one extra sampler to attach the texture to (so that you don't use one one of the samplers required by the spec).

Xmas · May 7, 2004

LRP in two cycles makes sense, as you need a scalar SUB first, then MUL and MAD.

Rolf N · May 7, 2004

Xmas said:
LRP in two cycles makes sense, as you need a scalar SUB first, then MUL and MAD.

Not necessarily. You can rewrite a LERP so that it takes only a SUB and a MAD.

Code:

(1-c)*a+c*b
=
a-c*a+c*b
=
c*(b-a)+a

SUB tmp,b,a;
MAD result,c,tmp,a;

DemoCoder · May 7, 2004

andypski said:
The only issue is that you would need one extra sampler to attach the texture to (so that you don't use one one of the samplers required by the spec).

I suggested a similar trick 2 years ago to handle the NRM macro, which is to have the driver replace it with a cube map normalization depending on whether static analysis indicated if wouldn't cause other problems. (actually my suggestion was that the normalize() call in GLSL could be analyzed and replaced with a lookup or an ALU sequence depending on need)

But people shot it down suggesting that using an extra texture without the developer knowing about it could lead to bizarre conditions when debugging performance. It also cuts down on your fillrate in a different way compared to using 3 instructions, or the newton raphson method.

ram · May 9, 2004

DemoCoder said:
Difference is, R420 can do 2 vec4 + texturing at same time. The ops that are possible are not equivalent.

If you count that way you've to count the two additional modifier-Mini-ALUs of NV40 too...

Demirug · May 9, 2004

The german PC Game Hardware (print magazin) have ask David Kirk about the two shaderunits:

short summary:

SU1: MUL (but not MAD) + some specialfunction (1/X, 1/sqrt(x)) + tex
SU2: MAD + some specialfunction + fog

cho · May 9, 2004

ram said:
If you count that way you've to count the two additional modifier-Mini-ALUs of NV40 too...

hmm, do you know where can i find the detail of nv40 mini-alus ?

Tridam · May 9, 2004

demalion said:
Tridam: which parts of your NV40 description are based on testing, and which are based on Kirk's comments? What testing and what comments? What modifiers are supported for the issued units...all the PS 1.4 modifiers? Wouldn't it be important to clarify this information?

Everything is based on testing and then I talked about my results with David Kirk and then do more testing ...

I presume that most PS1.4 modifiers are supported. However I didn't test all of them. Of course it's also important to know what modifiers are supported.

demalion said:
The term "instructions" seems like it is being abused to me, especially in a context for comparison of different architectures. Normalization should count as the equivalent of 3 instructions (1 "complex", 2 "simple") for comparison. "Modifiers" should be mentioned as distinct from more general processing ability. Where is the consideration of what can actually be effectively be completed in the same clock cycle...the current discussion seems to assume all ops take the same amount of clock cycles, but is this the case? What impact does a texture op have and for how many clock cycles will it affect IPC? Is it just 1 clock cycle of IPC impact? What about value handling characteristics...as one example: did the NV3x really have a penalty for using constants, and, if so, did the NV40 get rid of it?

That's very true. What I talked about was just a basic figure. Even if I still hadn't done enough tests to know everything, I've already done a lot of tests. The difference between instruction and instruction slots is of course important. However when talking about a "basic" figure it isn't possible to make this difference.

For example, my tests showed that RSQ is done in 2 cycles by the first unit. Microsoft says that RSQ should use only 1 instruction slot.

I have not seen that texturing has an impact out of the pipeline pass into the one it is done.

Registers impact in NV40 is very hard to determine. Everyone has already talked a lot about temporary register I think. But what about constant and interpolated (color and coordinate) registers ? I've done some tests about this specific point. However as always time is lacking. It seems that the limitation is per pipeline pass : 2 constant registers and 1 interpolated register per pipeline pass.

demalion said:
ATI has claimed, in my understanding, to be able to do independent ADD and MUL in addition to the full ALUs on the R3xx. Was this discounted to only be described as a modifier? Or is the presumption simply that the discussed "NV40 modifiers" are representing the same functionality?
Has anything changed with regard to this functionality on the R420, or what ATI claims (or is there some error in my understanding of what they claimed)?

I can't see R360 or R420 doing a MUL or an ADD into the mini-ALU. I presume that ATI was talking about this kind of ADD : ADD r0, r0, r0.

Luminescent said:
According to Digit-Life's NV40 report, Nvidia removed SIN/COS specific hardware in lieu of 1D table lookups which they claim work more efficiently and cost less transistors.

That's basically what David Kirk said me.

Xmas said:
It's guesswork, of course. The position of the SF-unit might be wrong, but it is there to indicate that you can use the output of the SFU as input to the MUL units, meaning you can do a DIV in one clock in SU0 (RCP + MUL). Also, a TEX does not mean you cannot use the SFU/MUL. However the texture coordinates are passed through SU0, so you can only use SFU/MUL to modify the texture coordinates, not for a separate calculation.

It makes sense and it is "compatible" with what I saw.

Luminescent said:
Cho, your results seem to confirm Xmas' data, indicating that there is only 1 ADD unit per pixel pipe in NV40, confirmed in test 1; test 2 confirms that there are independent MUL and ADD units. It would be interesting to run a test which includes two MUL ops and one ADD, as there are two MUL units per pipeline, one in each ALU.

You can't do that. It won't run in 1 pipeline pass unless one instruction can be transformed into a modifier or one mul and the add can become a mad.

zeckensack said:
Xmas said:

LRP in two cycles makes sense, as you need a scalar SUB first, then MUL and MAD.

Click to expand...

Not necessarily. You can rewrite a LERP so that it takes only a SUB and a MAD.

That's the same. SUB and MAD can't be done by the first ALU. LRP is done in 2 cycles on the NV40. It's done by the second ALU so the first one is available.

cho said:
hmm, do you know where can i find the detail of nv40 mini-alus ?

You can't. NVIDIA said me that they do not want to give additionnal details because they want developers to focus on their algorithm and because these details won't be the same for the whole NV4x familly.

DemoCoder · May 10, 2004

Tridam said:
For example, my tests showed that RSQ is done in 2 cycles by the first unit. Microsoft says that RSQ should use only 1 instruction slot.

Slot != cycle time. Microsoft says "most instructions should execute in 1 cycle". It doesn't demand it. Slots are just a mechanism for counting the max number of instructions.

instructions/operations per clk

demalion

Luminescent

Xmas

Porous

Luminescent

Xmas

Porous

Luminescent

Xmas

Porous

cho

Dave Baumann

Gamerscore Wh...

LeStoffer

Luminescent

andypski

Xmas

Porous

Rolf N

Recurring Membmare

DemoCoder

ram

Demirug

cho

Tridam

DemoCoder

Similar threads