instructions/operations per clk

DemoCoder said:
Tridam said:
For example, my tests showed that RSQ is done in 2 cycles by the first unit. Microsoft says that RSQ should use only 1 instruction slot.

Slot != cycle time. Microsoft says "most instructions should execute in 1 cycle". It doesn't demand it. Slots are just a mechanism for counting the max number of instructions.

Of course but I don't understand where you want to go with this... If a 1 slot instruction takes 2 cycles to execute it conflicts with the spec from my understanding. Try running more than 256 of this instruction on a 512 instructions slots limited architecture and you'll see what I mean ;)

Maybe I'm jus wrong but I think that the max instruction caps corresponds roughly to the max number of pipeline passes. So the hardware won't always be able to execute every shader even if it fits into the max number of instruction slots. It could be the case when an instruction takes more cycles to execute than its number of instruction slots. If I'm true then there's a lot of other cases.
 
Tridam:

I don't see how that would break spec. Even if each instruction took 1000 clock cycles to execute so long as there is in SRAM enough memory to hold 512 instys the program should run "fine." (but slow) On the early MIPS CPUs (R2000, et al) a simple integer multiply took an ungodly number of cycles yet of course those chips were all valid MIPS implementations. Also, "one insty per cycle" would be quite the vague requirement. One insty per cycle per Vector FPU? One insty per cycle per pipe? One insty per cycle per chip? There's no reason why an interface spec like DirectX should concern itself with messy implementation details.
 
Instruction execution latency of 2 cycles != instruction throughput of 2 cycles if the hardware is pipelined.

Tridam, 2 cycles for NV40 on RSQ is no different than NV30 when running at full precision; the instruction should run in 1 cycle under partial precision mode.
 
Tridam said:
For example, my tests showed that RSQ is done in 2 cycles by the first unit. Microsoft says that RSQ should use only 1 instruction slot.
Cycles (aka latency) or slots? They are different things. The former is just a performance issue while the latter determines an upper bound on how many instructions you can pack into the program (given a limit of, say, 128/256 instruction slots)
 
I know what you're talking about.

What i'm saying is that one of the hardware limitations is the max number of loopback / pipeline pass. I'm also saying that because of this, maybe some shaders (using too many pipeline passes) won't run even if they fit into the max instruction limit exposed in DX.

Regarding ATI we can take the problem from the other side. X800 can handle 512 pipeline passes. But it can do 1536 instructions. However ATI can't expose 1536 instructions in DX as the X800 can only run some 1536 instructions long shaders (33% vec3, 33% scalar, 33% texturing).


So basically I think that sometimes, especially on NVIDIA GPUs, the max number of instruction slots can conflict with the max number of pipeline passes.
 
Given the support for loop/rep nested up to 4 times in PS3.0, I highly doubt there will be any problems. If there is a limit to pipeline passes, then loop effectively won't work. Just the first level loop allows 256-loops over 512-instructions = 128k insts.

Do you have any evidence of this conflict on NV4x?
 
DemoCoder said:
Given the support for loop/rep nested up to 4 times in PS3.0, I highly doubt there will be any problems. If there is a limit to pipeline passes, then loop effectively won't work. Just the first level loop allows 256-loops over 512-instructions = 128k insts.

Do you have any evidence of this conflict on NV4x?

No.

But I'm nearly sure that in case of RSQ, the API will see it at a 1 slot instructions but the driver/GPU will change it to a 2 slots instructions. With shader near to the max instructions slots limit it can be a problem. However if the hardware can do more than what it exposes then maybe we'll never see a problem.

I also remember that Microsoft has changed something about the instructions slots count because the first way they were couting instructions was incompatible with some hardware. However I don't fully remember the story. I think the problem was that DP4 was using 2 "slots (as the driver defines it)"/cycles on some hardware. Microsoft asked developpers to count DP4 as a 2 instructions slot (as the API defines it) instruction.
 
Again, I don't see the point of equating slots and cycles. Slots are a DirectX abstraction for limiting the instruction count. This doesn't neccessarily have anything to do with instruction cycles. The assumption that HW has a limited number of passes/stages is very architecture specific and an artifact of the way shaders used to be done with register combiners.

TEX instructions take hundreds of cycles. It is throughput that matters.


For all we know, NVidia uses the "slot limit" to tailor the amount of SRAM needed to hold a shader on-chip and has nothing to do with the total number of passes.

MS SDK/DDK merely says that most instructions should execute with a throughput of 1 cycle. It doesn't guarantee it.
 
DemoCoder said:
Again, I don't see the point of equating slots and cycles. Slots are a DirectX abstraction for limiting the instruction count. This doesn't neccessarily have anything to do with instruction cycles. The assumption that HW has a limited number of passes/stages is very architecture specific and an artifact of the way shaders used to be done with register combiners.

TEX instructions take hundreds of cycles. It is throughput that matters.


For all we know, NVidia uses the "slot limit" to tailor the amount of SRAM needed to hold a shader on-chip and has nothing to do with the total number of passes.

MS SDK/DDK merely says that most instructions should execute with a throughput of 1 cycle. It doesn't guarantee it.

I'm not saying than 1 slot = 1 cycle. I'm just saying that there is a supposed link between the number of slots used and the number of ALU usages (cycles if only 1 ALU).

Anyway I've run some more tests. I think that what I was saying is true for some architectures (R3x0/R420 and some older ones) but not exact for NV3x/4x. However I'm sure that the API and the driver/hardware don't see a 1 slot instruction running in 2 cycles (ALU usages) the same way and that in some circumstances it can cause a problem. Of course if the exposed max number of instruction slots is including this fact we will never see a problem.

A ~400 RSQ (+ some other instructions) can run on NV40.
 
zeckensack said:
Xmas said:
LRP in two cycles makes sense, as you need a scalar SUB first, then MUL and MAD.
Not necessarily. You can rewrite a LERP so that it takes only a SUB and a MAD.
Code:
(1-c)*a+c*b
=
a-c*a+c*b
=
c*(b-a)+a

SUB tmp,b,a;
MAD result,c,tmp,a;
Yes that works too, but still takes two cycles. Depending on the "surrounding" instructions, one or the other might be better, because in your example the SUB is a vector operation.


btw, anyone know the core clock of a Mobile Radeon 9600 Pro Turbo? I'm going to put it under some heavy IPC testing tonight...
 
That's fine. The shaders I ran all gave me a fraction of about 1.3gp/s fillrate, depending on their length.

I didn't have much time unfortunately, and I guess I have to repeat the test with newer drivers. 6.14.10.6392 is the latest version number I can get from Dell, anyone know how to install newer Catalyst drivers on a laptop?

What really surprised me is that add r0, r0, r0 wasn't free in any case. If there are any modifiers at all, I'd expect at least *2.

add, mul and mad all take one cycle, no surprise here. mul and add are not independent units, so you can do

mul r0, v0, c0
add r0, r0, c1
(equals mad, but without the input register limitation)

but not

mul r0, v0, c0
add r1, r0, c1

(and then use both r0 and r1 further on) in one cycle.

rcp also takes one cycle. lrp takes two cycles, but you can only co-issue one instruction, which doesn't make much sense because if you replace it with sub and mad like zeckensack suggested, you can co-issue two instructions in two cycles.
 
I would expect r0 + r0 to get reduced to a x2 modifier by any compiler worth 2 cents. Algebraic identity and strength reductions should be some of the first optimizations detected. Even the case of r0 + c0 with c0 having been def'ed with a value of 2 should be detected. r0 * 1 and r0 + 0 should also be "free", as well as r0 ^ 0 and r0 ^ 2 should be reduced to r0 * r0.
 
DemoCoder said:
I would expect r0 + r0 to get reduced to a x2 modifier by any compiler worth 2 cents. Algebraic identity and strength reductions should be some of the first optimizations detected. Even the case of r0 + c0 with c0 having been def'ed with a value of 2 should be detected. r0 * 1 and r0 + 0 should also be "free", as well as r0 ^ 0 and r0 ^ 2 should be reduced to r0 * r0.

ATI compiler seems strange in some drivers releases. I mean sometimes it's smarter with an older release.

In driver x.1 it can detect an optimisation possibility
In driver x.2 it can't
In driver x.3 it can
In driver x.4 it can't
...


NVIDIA's compiler is more "constant".

Code:
ps_2_0

dcl v0

def c0, 0.5, 0.25, 0.3, 0.4
def c1, 0.1, 0.2, 0.3, 0.2
def c2, 2.0, 0.0, 0.0, 0.0

mul r0, c0, v0
mul r0, r0, c2.x
mul r0, r0, c1
mul r0, r0, c2.x

mov oC0, r0

This one is of course running in 1 cycle as expected.
 
Back
Top