Maximum number of instructions per clock of a NV40 pipeline

tcchiu

Newcomer
NV40 Technology explained.
http://www.3dcenter.org/artikel/nv40_pipeline/index2_e.php

(To clarify first, the terminology "pipeline" in this post is not a hardware quad-pipeline.)

Dual- and co-issue combined, an NV40 pipe can execute up to four instructions – while having a single all-purpose arithmetic unit only.

I don't understand. I think the number should be eight.

With co-issue, each single shader unit can execute four instrunctions per clock. For example, shader unit 1 should be able to execute the following four instructions in a cycle:
Code:
rsq r0.r, v0.r
mul r1.r, r0.r, v1.r
rsq r0.b, v1.b
mul r1.b, r0.b, v2.b
r and b channels could be co-issued. Each mul depends on the previous rsq, but could be dual-issued. That is, shader unit 1 can execute these four instrunctions per clock.

Since there are two shader units, and suppose NV40 could be dual-issued between the two shader units - the destination of shader unit 1 could be the source of shader unit 2 (if not, why this 'complement' design?), there can be another four instrunctions, and the maximum number of instruction per clock should be eight.

Because NV40 has 16 pipes in total, overall chip-perfomance is 16T + 128M at maximum.

What I misssed?
 
Three things:
1. I'm not sure that you can co-issue special functions.
2. The second shader unit is not identical to the first.
3. There may be a hard limit due to bandwidth or cache constraints that prevent more than four instructions from being executed each clock.
 
According to 3DGPU the limit of 4 instructions per clock is a result of the the SUs available number of data paths (4 data paths).

I'm sure that 2 RCP cannot be co-issued in parallel, as there is only one RCP unit, but I'm not sure about RCP and RSQ? Acording to the article, these are separate units, although I believe they are arranged in serial, which would prevent them from functioning independently. I guess one unit could modify the output of the other, but this would come to no real benefit.
 
Chalnoth said:
Three things:
1. I'm not sure that you can co-issue special functions.
2. The second shader unit is not identical to the first.
3. There may be a hard limit due to bandwidth or cache constraints that prevent more than four instructions from being executed each clock.

I am not sure about 1 either, but it seems okay according to the figures gave in "NV40 technology explained". I know it may be just a guess, that I seeks for more (solid) information here.

About 2: I know the shader unit 2 are not identical to the shader unit 1. The four-line example code in my original post cannot be executed in the shader unit 2.

Maybe I didn't make myself clear in the original post. The four-instruction sample code should be executed in a cycle by the shader unit 1 (I am not sure, that's why I posted this question). What I meant by "there can be another four instructions (could be executed by the shader unit 2)" are not the same four-instruction sample code I gave for the shader unit 1. For example, another sequence of four instructions consist of MAD and DOT, with totally different channels (so they can be co-issued and dual-isssued as the sample four-instruction for the shader unit 1).

About 3: in the case of the "hard limit" (e.g. the read ports of the instrunction buffer), the linked article should have mentioned it, isn't it?
 
Luminescent said:
I guess one unit could modify the output of the other, but this would come to no real benefit.

The benefit is capable of executing four "dependent" instrunctions in a clock cycle.

Code:
rcp r0.r, v0.r
mul r1.r, r0.r, v1.r
mul r2.r, r1.r, v2.r
add r3.r, r2.r, v3.r

The first two instrunctions could be dual-issued in the shader unit 1, and the later two in the shader unit 2. If the shader unit 2 could modify the result of the shader unit 1, even though the second mul depends on the destination register (r1.r) of the first mul, these four instrunctions could be executed in one clock cycle - under the assumption that the dual-issue happens across two shader units (I cannot find any statement in "NV40 Technology explained" supporting this).
 
Dual-issue means having SU1 and SU2 perform different operations. Co-issue means performing two ops in one SU, effectively splitting it. You can only co-issue two MAD/MUL/ADD/DPx, and only use up to four channels. I.e. SU2 can only do 2 instructions per clock.

You can't co-issue special functions. I think the maximum you can reach is 6 instructions/clock, e.g.
RCP + 2 MUL2 in SU1
2 MAD2 in SU2
NRM_PP (two cycles latency)
 
Xmas said:
You can't co-issue special functions. I think the maximum you can reach is 6 instructions/clock, e.g.
RCP + 2 MUL2 in SU1
2 MAD2 in SU2
NRM_PP (two cycles latency)
Perhaps, unless there's an absolute hard limit on the number of instructions.
 
Xmas said:
You can only co-issue two MAD/MUL/ADD/DPx, and only use up to four channels. I.e. SU2 can only do 2 instructions per clock.

What does "only use up to four channels" mean? Doesn't the following code snippet use up to four channels? (I expect SU2 can execute all of them in a clock cycle.)

Code:
mul r0.rg, v1.rg, v2.rg
add r1.rg, r0.rg, v3.rg
mul r0.ba, v4.ba, v5.ba
add r1.ba, r0.ba, v6.ba
 
http://techreport.com/etc/2004q2/tamasi/index.x?pg=4

TR: Inside of the pixel pipeline, you've got two of the FP32 pixel shaders in each pixel pipe. Can both of them do parallel vector operations per clock?

Tamasi: Yep. The way to think about it is that you can dual (or more) issue instructions per shader unit, and then you can co-issue between them as well, so, in fact, you can have four, or in some cases more than four, instructions being issued on a single pixel pipeline—two in shader unit one and two in shader unit two—two independent instructions in shader unit one and another two independent instructions in shader unit two. We also have mini-ALUs in each of those shader units, as well, which also can have instructions issued to them. We gave a shader example that actually had up to seven instructions being executed in parallel in one pass.
It appears one NV40 pipe can do 7 instructions per clock under a special circumstance.
 
pat777 said:
http://techreport.com/etc/2004q2/tamasi/index.x?pg=4

TR: Inside of the pixel pipeline, you've got two of the FP32 pixel shaders in each pixel pipe. Can both of them do parallel vector operations per clock?

Tamasi: Yep. The way to think about it is that you can dual (or more) issue instructions per shader unit, and then you can co-issue between them as well, so, in fact, you can have four, or in some cases more than four, instructions being issued on a single pixel pipeline—two in shader unit one and two in shader unit two—two independent instructions in shader unit one and another two independent instructions in shader unit two. We also have mini-ALUs in each of those shader units, as well, which also can have instructions issued to them. We gave a shader example that actually had up to seven instructions being executed in parallel in one pass.
It appears one NV40 pipe can do 7 instructions per clock under a special circumstance.

The number 7 comes from a shader example used by NVIDIA for the press briefings. IIRC some useless instructions were added to this shader to expand the max number of instructions showed to the press. A lot of instructions were modifiers and there were a lot of syntax errors (mismatch of opengl and direct3d syntax). It always makes me smile when I read this number 7 :D


I think that the maximum is what Xmas said + modifiers

rcp.w
mul.w
mul.xyz
bx2.xyzw
nrm_pp
add.xy
add.zw
bx2.xyzw

-> 8 instructions with modifiers

However many things can prevent this to happen. Register usage (1 interpolated, 2 constants, ?2 temporary write / 4 temporary read?) or scheduling difficulties.
 
Back
Top