very informative interview on nVIDIA

991060 · May 13, 2004

http://www.nvnews.net/articles/geforce_6_interview/

ChrisRay · May 13, 2004

*has a little sadistic grin on my face knowing something cool has happened*

991060 · May 13, 2004

Besides they escaping from the question concerning Farcry, other parts of the interview are simply specutacular.

They claimed each dyanmic jump in NV40 costs 2 clocks, how can this argee on the suggestion to use FP16 as often as possible, if the cost is constant regardless of the register usage.

And, what is the difference between SIMD and MIMD in nature? Why is the pixel shader unit in NV40 still SIMD even if it supports true dyanmic branching?

DemoCoder · May 13, 2004

Because there are still register limitations, it's just that the limits have increased to the point where for the average case, FP32 will run at full speed. There are still scenarios where FP16 will run faster, but it's not like it was on the NV3x where you *had* to use FP16 and the penatly for going over 2 FP32 registers was a 50% reduction in throughput.

Their suggestion is more a rule of thumb. Use only what you need.

ERP · May 13, 2004

Actually they claimed each dynamic branch instruction has a latency of 2 cycles....

So IF ENDIF is 4 cycles and IF ELSE ENDIF is 6.

But the important word here is latency, does this mean that the right selection of instructions can absorb that latency?

It would be interesting to see some timings with a lot of ops between the IF ELSE ENDIF clauses, to see what the actual timing impact of the instrctions is when there are ops to abosrb the latency.

Luminescent · May 13, 2004

I don't recall reading an answer to this in any reviews, so I'll ask: how does NV40s theoretical fillrate change when the FP framebuffer is used? Does it remain the same, bandwith limitations aside? What about the fillrate with FP texture filtering?

ERP, the way the interview response was worded leads me to believe that 2 cycles is the actual execution latency of a conditional, although the hardware could mask this by issuing 1 instruction per clock.

991060 · May 13, 2004

DemoCoder said:
Because there are still register limitations, it's just that the limits have increased to the point where for the average case, FP32 will run at full speed. There are still scenarios where FP16 will run faster, but it's not like it was on the NV3x where you *had* to use FP16 and the penatly for going over 2 FP32 registers was a 50% reduction in throughput.

Their suggestion is more a rule of thumb. Use only what you need.

Yes, I know NV40 still has the problem in general. My question is how the register usage affects dynamic branching's penalty alone. Will there be cases under which a branch uses more registers causes longer pipeline stall?

991060 · May 13, 2004

ERP said:
But the important word here is latency, does this mean that the right selection of instructions can absorb that latency?

Not likely according to the test I did a few days ago, I used a very simple routime:
if
add
else
add
endif

And the result shows that no matter which branch it takes, the whole shader is always finished in 9 cycles, which means the penalty is 8 clocks, 2 clocks higher than what's suggested. And I don't find clue that the latency can be hidden(I could be wrong of course, it was a rash test).

Another thing worth mentioning is that the static branching seems not to be free, it has a latency of 1 clock, but can be hidden by subsequent instructions. This was observed under 61.11, could be a driver problem?

Evildeus · May 13, 2004

Yeah it was interesting, a little (

) change from pure PR interviews.

DemoCoder · May 13, 2004

991060,
I have heard there are problems with the driver compiler. Your shader should take 6 + 1 = 7 cycles.

But yes, it is true, you can't "pipeline" the latency away. Therefore, you only choose to use dynamic branches if the expected value of executing the most common branch clause+ 6 cycles is less than the time it takes to execute both clauses (say, using predicates) If you use texture loads inside the branches, it's much harder to predict.

Hyp-X · May 13, 2004

DemoCoder said:
If you use texture loads inside the branches, it's much harder to predict.

AFAIK you can't do that on the NV40 ...

Anyway, I think future driver versions will replace short if/then/endif structures with predications if that will be faster even assuming coherent execution. They'll need that for HLSL output support.

Dave Baumann · May 13, 2004

Luminescent said:
I don't recall reading an answer to this in any reviews, so I'll ask: how does NV40s theoretical fillrate change when the FP framebuffer is used? Does it remain the same, bandwith limitations aside? What about the fillrate with FP texture filtering?

I pressed Kirk for this at the editors day and he was reluctant to confirm that it was halved, however if wouldn't say it wasn't either. The best I got out of him in the end was that "we alway aim to balance fill-rate and bandwidth" which kinda confirms to me that it is halved.

However, I don't necessarily think that pertains to shader rate, merely ROP output fill-rate.

demonic · May 13, 2004

ChrisRay said:
*has a little sadistic grin on my face knowing something cool has happened*

for the nvidiots or the rest of us?

DemoCoder · May 13, 2004

Hyp-X said:
DemoCoder said:

If you use texture loads inside the branches, it's much harder to predict.

Click to expand...

AFAIK you can't do that on the NV40 ...

I've heard different. Besides, PS3.0 demands it, so either the driver has to hoist the loads out of branches (in the presence of deep nesting, function calls, and loops) or it must support the load inside the branch itself.

Xmas · May 13, 2004

MS removed texture loads inside branches from the spec.

Excellent interview.

DemoCoder · May 13, 2004

Where's this specified? In the DDK or SDK? In the latest SDK beta, I can't find any mention of this.

ChrisRay · May 13, 2004

demonic said:
ChrisRay said:

*has a little sadistic grin on my face knowing something cool has happened*

Click to expand...

for the nvidiots or the rest of us?

Naw for me.

I cant be paticular But I played a small hand ^^

Luminescent · May 13, 2004

I thought it was cool, Chris

Tridam · May 13, 2004

Xmas said:
MS removed texture loads inside branches from the spec.

Excellent interview.

AFAIK MS didn't remove texture loads but only dependant texture loads.

Simon F · May 13, 2004

ERP said:
Actually they claimed each dynamic branch instruction has a latency of 2 cycles....

So IF ENDIF is 4 cycles and IF ELSE ENDIF is 6.

But the important word here is latency, does this mean that the right selection of instructions can absorb that latency?

It would be interesting to see some timings with a lot of ops between the IF ELSE ENDIF clauses, to see what the actual timing impact of the instrctions is when there are ops to abosrb the latency.

I can think of two possible interpretations:
a) Their system has branch delay slots (eg as on MIPS or the Texas Instruments' C30 DSP), where you execute the branch instruction but the system runs the next "N" instructions before the branch takes place OR
b) it just burns cycles so only use it if the gain from culling out nonexecuted instructions is greater than the cost of the dynamic branching <shrug>

very informative interview on nVIDIA

991060

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

991060

DemoCoder

ERP

Luminescent

991060

991060

Evildeus

DemoCoder

Hyp-X

Irregular

Dave Baumann

Gamerscore Wh...

demonic

DemoCoder

Xmas

Porous

DemoCoder

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

Luminescent

Tridam

Simon F

Tea maker

Similar threads