very informative interview on nVIDIA

DemoCoder said:
It's b. You can't hide the branch latency.
Depends on how branching is implemented. CPUs have branch prediction algorithms, which in most situations hide latency quite well. But I doubt that GPUs are already that advanced.
 
I felt the same way when I first found out, but it just means you have to tread carefully using them. It's probably alot easier than dealing with verture texture fetch latency!
 
Any word on the possiblility of branch prediction in NV40's pixel pipes like sonix mentioned? I remember an Anandtech Geforce FX review which mentioned Nvidia's branch prediction ability in relation to Ati's. Here's what I'm referring to:

Anandtech said:
Whereas in the CPU world you need to be able to predict branches with ~95% accuracy, the requirements are no where near as stringent in the GPU world. NVIDIA insists that their branch predictor is significantly more accurate/efficient than ATI's, however it is difficult to back up those claims with hard numbers.
 
any idea why the "endif" also takes 2 cycles? (that's just bizarre). Also was the "if add else add endif" shader running on the VS or the PS?
 
psurge said:
any idea why the "endif" also takes 2 cycles? (that's just bizarre). Also was the "if add else add endif" shader running on the VS or the PS?

Yes this seems to imply a solution that is not at all like a conventional CPU.

I had considered a few solutions that might show behaviour like this, they mostly boil down to the body of the if not being in contiguous memory with the surrounding code. i.e. it effectively requires 2 jumps for the if not one.
 
ERP, well the if would be a single "not taken" branch plus an unconditional branch to get past the else body, but...

Edit: ignore above I see what you mean now.

Speculation:

The 2 cycle branch penalty would appear to indicate that a branch (unconditional or not) cannot co-issue with any other instructions (1 cycle) and that the execution units are stalled for 1 cycle after the new IP value is known (for next instruction fetch).

Maybe the endif penalty comes from a pipeline flush - i.e. making sure that any in-flight instruction has written it's results back to the register file?
 
psurge said:
any idea why the "endif" also takes 2 cycles? (that's just bizarre). Also was the "if add else add endif" shader running on the VS or the PS?
It was run in the pixel shader with 61.11 and DX9.0c beta2 runtime.
 
For the slow and less technically inclined amongst us, would someone explain what the big revelation is?

LW.
 
sonix666 said:
Depends on how branching is implemented. CPUs have branch prediction algorithms, which in most situations hide latency quite well. But I doubt that GPUs are already that advanced.
While it does depend upon how the branch is implemented, branch prediction is not an absolute necessity in hiding the branch latency.

That is, if you have some calculations before the branch that do not affect the branch, it is conceivable that the hardware could execute those to take care of branch latency. I don't know if this is possible on the NV4x, though.

As a side note, it may be that the 2-cycle latency is the minimum latency of branching on the NV40, that there is actually more latency there that can be hidden when there are more than a couple of instrucitons. I think that may be the best explanation of the performance of the "if add else add endif" program, as one would expect that to take about 5 clocks (if (2) add (1) endif (2) or else (2) add (1) endif (2)). A longer program may get closer to the 2-cycle latency (this seems to imply 4-cycle latency).
 
psurge said:
any idea why the "endif" also takes 2 cycles? (that's just bizarre). Also was the "if add else add endif" shader running on the VS or the PS?
If all branches, whether conditional or not, are handled by a simple goto statement, then endif could well force a branch. That is, the "endif" signifies returning to the rest of the program.
 
Back
Top