very informative interview on nVIDIA

*has a little sadistic grin on my face knowing something cool has happened* ;)
 
Besides they escaping from the question concerning Farcry, other parts of the interview are simply specutacular.

They claimed each dyanmic jump in NV40 costs 2 clocks, how can this argee on the suggestion to use FP16 as often as possible, if the cost is constant regardless of the register usage.

And, what is the difference between SIMD and MIMD in nature? Why is the pixel shader unit in NV40 still SIMD even if it supports true dyanmic branching?
 
Because there are still register limitations, it's just that the limits have increased to the point where for the average case, FP32 will run at full speed. There are still scenarios where FP16 will run faster, but it's not like it was on the NV3x where you *had* to use FP16 and the penatly for going over 2 FP32 registers was a 50% reduction in throughput.

Their suggestion is more a rule of thumb. Use only what you need.
 
Actually they claimed each dynamic branch instruction has a latency of 2 cycles....

So IF ENDIF is 4 cycles and IF ELSE ENDIF is 6.

But the important word here is latency, does this mean that the right selection of instructions can absorb that latency?

It would be interesting to see some timings with a lot of ops between the IF ELSE ENDIF clauses, to see what the actual timing impact of the instrctions is when there are ops to abosrb the latency.
 
I don't recall reading an answer to this in any reviews, so I'll ask: how does NV40s theoretical fillrate change when the FP framebuffer is used? Does it remain the same, bandwith limitations aside? What about the fillrate with FP texture filtering?

ERP, the way the interview response was worded leads me to believe that 2 cycles is the actual execution latency of a conditional, although the hardware could mask this by issuing 1 instruction per clock.
 
DemoCoder said:
Because there are still register limitations, it's just that the limits have increased to the point where for the average case, FP32 will run at full speed. There are still scenarios where FP16 will run faster, but it's not like it was on the NV3x where you *had* to use FP16 and the penatly for going over 2 FP32 registers was a 50% reduction in throughput.

Their suggestion is more a rule of thumb. Use only what you need.

Yes, I know NV40 still has the problem in general. My question is how the register usage affects dynamic branching's penalty alone. Will there be cases under which a branch uses more registers causes longer pipeline stall?
 
ERP said:
But the important word here is latency, does this mean that the right selection of instructions can absorb that latency?

Not likely according to the test I did a few days ago, I used a very simple routime:
if
add
else
add
endif

And the result shows that no matter which branch it takes, the whole shader is always finished in 9 cycles, which means the penalty is 8 clocks, 2 clocks higher than what's suggested. And I don't find clue that the latency can be hidden(I could be wrong of course, it was a rash test).

Another thing worth mentioning is that the static branching seems not to be free, it has a latency of 1 clock, but can be hidden by subsequent instructions. This was observed under 61.11, could be a driver problem?
 
991060,
I have heard there are problems with the driver compiler. Your shader should take 6 + 1 = 7 cycles.

But yes, it is true, you can't "pipeline" the latency away. Therefore, you only choose to use dynamic branches if the expected value of executing the most common branch clause+ 6 cycles is less than the time it takes to execute both clauses (say, using predicates) If you use texture loads inside the branches, it's much harder to predict.
 
DemoCoder said:
If you use texture loads inside the branches, it's much harder to predict.

AFAIK you can't do that on the NV40 ...

Anyway, I think future driver versions will replace short if/then/endif structures with predications if that will be faster even assuming coherent execution. They'll need that for HLSL output support.
 
Luminescent said:
I don't recall reading an answer to this in any reviews, so I'll ask: how does NV40s theoretical fillrate change when the FP framebuffer is used? Does it remain the same, bandwith limitations aside? What about the fillrate with FP texture filtering?

I pressed Kirk for this at the editors day and he was reluctant to confirm that it was halved, however if wouldn't say it wasn't either. The best I got out of him in the end was that "we alway aim to balance fill-rate and bandwidth" which kinda confirms to me that it is halved.

However, I don't necessarily think that pertains to shader rate, merely ROP output fill-rate.
 
Hyp-X said:
DemoCoder said:
If you use texture loads inside the branches, it's much harder to predict.

AFAIK you can't do that on the NV40 ...

I've heard different. Besides, PS3.0 demands it, so either the driver has to hoist the loads out of branches (in the presence of deep nesting, function calls, and loops) or it must support the load inside the branch itself.
 
ERP said:
Actually they claimed each dynamic branch instruction has a latency of 2 cycles....

So IF ENDIF is 4 cycles and IF ELSE ENDIF is 6.

But the important word here is latency, does this mean that the right selection of instructions can absorb that latency?

It would be interesting to see some timings with a lot of ops between the IF ELSE ENDIF clauses, to see what the actual timing impact of the instrctions is when there are ops to abosrb the latency.

I can think of two possible interpretations:
a) Their system has branch delay slots (eg as on MIPS or the Texas Instruments' C30 DSP), where you execute the branch instruction but the system runs the next "N" instructions before the branch takes place OR
b) it just burns cycles so only use it if the gain from culling out nonexecuted instructions is greater than the cost of the dynamic branching <shrug>
 
Back
Top