Shader Perf panel of NVIDIA FX Composer 1.0

moichi

Newcomer
Very informative tool about CineFX architecture.
http://developer.nvidia.com/object/fx_composer_home.html

For example, it seems that NV35/NV36/NV38's combiners only support fp16 multiply-add.
No fp32 support.


Sample source code:

PixelShader = asm
{
ps_2_x
dcl t0
dcl t1
dcl_2d s0
dcl_2d s1
def c0, 0.1, 0.2, 0.3, 0.4

texld r0, t0, s0 // tex unit
texld r1, t1, s1 // tex unit
mad_pp r0, r0, r1, c0 // combiner stage1(fp16) or shader core(fp32)
mad_pp r0, r0, r1, c0 // combiner stage2(fp16) or shader core(fp32)

mov oC0, r0 // will remove by unified compiler
};

Shader Perf Panel results:
****************************************
Target: GeForceFX 5900 (NV35) :: Unified Compiler: v56.58
Cycles: 1 :: # R Registers: 2
GPU Utilization: 100.00%
****************************************
Target: GeForceFX 5700 (NV36) :: Unified Compiler: v56.58
Cycles: 1 :: # R Registers: 2
GPU Utilization: 100.00%
****************************************


If I change "mad_pp"(fp16) to "mad"(fp32).

Shader Perf Panel results:
****************************************
Target: GeForceFX 5900 (NV35) :: Unified Compiler: v56.58
Cycles: 3 :: # R Registers: 2
GPU Utilization: 50.00%
****************************************
Target: GeForceFX 5700 (NV36) :: Unified Compiler: v56.58
Cycles: 3 :: # R Registers: 2
GPU Utilization: 50.00%
****************************************

It seems that mad instruction allocate to shadre core, so need 3cycle.
(GPU Utilization 50.00% because of no-utilization of combiners, perhaps)
 
Huh. That's the way I had thought that they worked originally, but I ended up seeing some other benchmarks that seem to refute this.

What happens when you do muls or adds?
 
According to DaveBaumann's information, given a little while back, the fp16 combiners in the NV35 can combine resources for a single fp32 op.
 
It seems that fp32 mad/mul/add instruction can't allocate combiners in FX Composer's emulation driver.
 
First we need to verify if the GPU utilization number given by the FX composer corresponds to the real world performace.
 
991060 said:
First we need to verify if the GPU utilization number given by the FX composer corresponds to the real world performace.
Yes, that would definitely be good. A good first step would be to test if there's a performance hit for using too many FP registers.
 
Chalnoth said:
Yes, that would definitely be good. A good first step would be to test if there's a performance hit for using too many FP registers.

I still don't get why

texld r0, t0, s0 // instr 1
texld r1, t1, s1 // instr2
mad_pp r0, r0, r1, c0 // instr3
mad_pp r0, r0, r1, c0 // instr4

can be executed in 1 cycle(according to the FX composer's report),doesn't instr4 need to wait until instr3 to finish? I see there's output-input dependency bwteen them.
 
A slight modification gives me more confusion.

texld r0, t0, s0
texld r1, t1, s1
mad r0, r0, r1, r0
mad r1, r0, r1, r1
mov oC0, r1

needs 4 cycyles, 4 R# on NV35 or 3 cycles, 2 R# on NV30.

And

texld r0, t0, s0
texld r1, t1, s1
mad r0, r0, r1, r1
mad r1, r0, r1, r0
mov oC0, r1

is reported to require 3 cycles, 2 R# on all NV3X platform.
 
991060 said:
doesn't instr4 need to wait until instr3 to finish?

It's troughput not latency that matters.
Remember that GPU is heavily "multithreaded" so instruction dependency isn't neccessarily an issue.
 
Hyp-X said:
It's troughput not latency that matters.
Remember that GPU is heavily "multithreaded" so instruction dependency isn't neccessarily an issue.

That's true if you have a comperatively long shader to execute,let's say at least dozens of instructions. The key is to keep the ALU as busy as possible,while in this specific case,there's no much work to feed the ALU,hence a performance hit.

Well,I may be totally wrong on this,just my two cents above,please feel free to corrent me anytime. :D
 
991060 said:
That's true if you have a comperatively long shader to execute,let's say at least dozens of instructions. The key is to keep the ALU as busy as possible,while in this specific case,there's no much work to feed the ALU,hence a performance hit.

Well,I may be totally wrong on this,just my two cents above,please feel free to corrent me anytime. :D

What Hyp-X said, the ALU has dozens of threads (color fragments) to process. By the time the second instruction in any thread needs to be computed the first instruction from that thread is long finished since the ALU had to process the first instruction for (let's say) 100 other threads. It doesn't matter whether there are two instys or 2000 instys.
 
Doesn't the ALU only work on single fragment at one time? :oops: That's what I believe since long time ago. :oops: If it's working in the way you and Hyp-X suggested,I admit there's no such dependency penalty(at least not as severe as I thought).
 
Hyp-X said:
991060 said:
doesn't instr4 need to wait until instr3 to finish?

It's troughput not latency that matters.
Remember that GPU is heavily "multithreaded" so instruction dependency isn't neccessarily an issue.

But that is a data dependency issue, you can have OOOe ( Register Renaming does not destroy dependency ), MT, whatever but that only means that while that isntruction is stalling ( and it is stalling until the result of instruction 3 is ready to be forwarded to isntruction 4 ) the processor is doing something else.

You had double pumped ALUs you could issue dependent instructions and have the resulkt performed in 1 single external clock cycle.
 
991060 said:
Doesn't the ALU only work on single fragment at one time? :oops: That's what I believe since long time ago. :oops: If it's working in the way you and Hyp-X suggested,I admit there's no such dependency penalty(at least not as severe as I thought).

The shader core, and the two register combiners can work on three different threads so there's no dependency.

OTOH, the register combiners can execute two non-dependent multiplies as a single instruction so if you have a series of MUL's there's gonna be a performance difference caused by dependency.
 
Panajev2001a said:
But that is a data dependency issue, you can have OOOe ( Register Renaming does not destroy dependency ), MT, whatever but that only means that while that isntruction is stalling ( and it is stalling until the result of instruction 3 is ready to be forwarded to isntruction 4 ) the processor is doing something else.

You had double pumped ALUs you could issue dependent instructions and have the resulkt performed in 1 single external clock cycle.

Let's presume for example register combiners has a latency of 3 cycles.
Then execute the two depending instructions as follows:

Code:
            | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10| 11| 12| 13|
Combiner 1   [ pixel 0 ] [ pixel 3 ] [ pixel 6 ] ...
                 [ pixel 1 ] [ pixel 4 ] [ pixel 7 ] ...
                     [ pixel 2 ] [ pixel 5 ] [ pixel 8 ] ...
Combiner 2               [ pixel 0 ] [ pixel 3 ] [ pixel 6 ] ...
                             [ pixel 1 ] [ pixel 4 ] [ pixel 7 ] ...
                                 [ pixel 2 ] [ pixel 5 ] [ pixel 8 ] ...

As you can see a pixel is finished every clock.
 
very very interesting,thanks for the information Hyp-X.

what about temp register usage? Does each thread have dedicated register file space or they share a common one?
 
991060 said:
very very interesting,thanks for the information Hyp-X.

what about temp register usage? Does each thread have dedicated register file space or they share a common one?

There's probably a big register file divided amongst the threads.
So more registers -> less threads -> more stalls -> lower performance.
 
Hyp-X said:
991060 said:
very very interesting,thanks for the information Hyp-X.

what about temp register usage? Does each thread have dedicated register file space or they share a common one?

There's probably a big register file divided amongst the threads.
So more registers -> less threads -> more stalls -> lower performance.
Actually I think that registers space is implicit in the pipeline. Registers travel in the pipeline. Physical registers are the pipeline. Each pipeline stage is at least a register space stage.

I presume that a register stage has a fixed number of registers across the pipeline. If you need more registers then you'll use two or more pipeline stage to store them -> perf drop.
 
Tridam said:
I presume that a register stage has a fixed number of registers across the pipeline. If you need more registers then you'll use two or more pipeline stage to store them -> perf drop.

Then how do you execute an operation between arbitrary 2 or even 3 registers?
If they are stored at the same space in different time it's not easy to make them meet.
Or maybe I'm not getting your concept completely...
 
Back
Top