texture fetch and shader math execution time in special cases

NicoRi

Newcomer
Hi,

when I do successive texture fetches using the same texture coordinate, without doing any other texture fetches in between (doesn't seem to make sense, yeah, but it does in my context :)), does it mean the value I receive is already in the texture cache? And is the tex. fetch thus quicker than a random location texture fetch?

Another question: when I do some mathematical computations in a pixel shader and one operand is always 0 so that the result is 0 as well, are these computations executed quicker than in the case when the operand is != 0 so that the computations actually have to be executed to receive the result?

Best Regards
Nico

Hardware: NVIDIA GeForce 7800GTX
 
It's going to depend on the hardware execution model and whether surrounding pixels are going to smash your cache. Modern GPUs assign several fragments to execute as a group on the shaders. In general, you should be storing repetative reads in registers as much as you can instead of issuing the fetch. cg and fxc should optimize this for you, but NV3X->G7X are sensitive to register pressure. NV3X was really bad about this (2 128-bit registers at full speed), NV4X was better (4 128-bit registers at full speed), and G7X better still (4 128-bit registers, but without a huge cliff on spill), but you still see performance drops.

As for '0'. FXC and cgc will do VERY aggressive expression elimination. Even on hand-off to the driver, they can, and sometimes will, recompile the asm into ISA and reoptimize. I'm not sure of the hueristics, but a constant change on Nvidia hardware has a noticeable overhead for many shaders, so the running guess is that they are quickly trying to reoptimize the shader.
 
I'm fairly sure current NVIDIA hardware has no constant storage on-chip, so constant changes require a shader patch by the driver every time.
 
Nvidia will not publicly confirm this. But, it's suggested in various performance talks Nvidia has done to avoid resetting constants. Some people get around this by using (abusing) texture coordinates to pass constants to avoid shader recompilation/specialization.
 
Thanks for your answer mhousten! However, I don't understand all of it, so I will go through some parts piece by piece:

In general, you should be storing repetative reads in registers as much as you can instead of issuing the fetch.

What do you mean by that? In my situation I do, say 5 successive texture fetches with some math inbetween (in a loop), where the tex coords of each fetch are dependent on the previous texture fetch and the calculations. In many cases the tex coords in the 2-4 final loop iteration don't change anymore and the result of the fetches is always 0 (that's why my second question).

but NV3X->G7X are sensitive to register pressure.

What does this mean?

As for '0'. FXC and cgc will do VERY aggressive expression elimination. Even on hand-off to the driver, they can, and sometimes will, recompile the asm into ISA and reoptimize.

But the result of the tex fetch is only 0 in the last loop iterations for a certain amount of pixels, so this cannot be optimized away, hm?

but a constant change on Nvidia hardware has a noticeable overhead for many shaders

What does this mean?

Hope you can clarify my confusion a bit, I'm not so much into the details of shading hardware...
 
If you do dependent texture fetches, then you can't use the texture value obviously, and since it's dependent, can't be optimized away. ;-)

If you are looping on dependent reads, and the reads differ on neighboring pixels, you may thrash your cache. If all of the pixels fetch from the same texture location, then all the reads will be from cache. However, a read on Nvidia hardware will cost you 1 cycle/32-bits just to issue the fetch, e.g float4 = 4 cycles that can't be hidden.

As for register usage, on G7X, as soon as you use more than 4 128-bit registers (or 8 64-bit), you start dropping performance on each use of the extra registers. This used be be a massive performance cliff on NV3X, half as much on NV4x, and a linear drop on G7X. In general, you want to avoid using more than 4 registers on NV4X, i.e. your fp40 assembly should only have 4 128-bit registers (or 8 half4's, etc).

If you change constants used in a shader, you will incur a significant penalty in the driver for what appears to be a recompile.
 
mhouston said:
This used be be a massive performance cliff on NV3X, half as much on NV4x, and a linear drop on G7X.
This is incorrect: G7x should have the exact same performance characteristic as NV4x in terms of performance related only to register usage, which is linear degratation. That said, G7x did get some additional improvements in other areas that may mask this somewhat.
 
Hrmm, we have a register performance test in GPUBench (slightly modified fp30_reg to fp40) and G7X does shows differences in behavior over NV4X. NV40 has more of a stair-step pattern at 4 register intervals (the last time we ran it quite awhile ago, so compiler could be better now). But, as you have said, there are changes to how register spill is handled, seemingly involving scheduling of fragments and memory system behavoir. But, the main point was that Nvidia hardware is sensitive to register pressure, so you need to be careful about this when writing applications.
 
Hrmm, we have a register performance test in GPUBench (slightly modified fp30_reg to fp40) and G7X does shows differences in behavior over NV4X..
How do you measure/know in advance the number of registers effectively used by the GPU on NVIDIA and ATI hw?
 
If you are looping on dependent reads, and the reads differ on neighboring pixels, you may thrash your cache. If all of the pixels fetch from the same texture location, then all the reads will be from cache.

And what if the reads on neighbouring pixels only slightly differ in the location? Is the tex cach still used then?

Also I'm interested in the answer to nAo's question.
 
You'll likely get a cache hit for that scenario, but it can and will depend on prior and future texture address calcs in your shader.

As for determining register count, it's (usually) a case of using a varying number of registers in a shader that won't be optimised away, and graphing performance. The drop off is usually very noticeable.
 
As for determining register count, it's (usually) a case of using a varying number of registers in a shader that won't be optimised away, and graphing performance. The drop off is usually very noticeable.
This is a flawed methodology cause you really don't know what is submitted to the hw for real (and it's not just about registers that one believes will not be optmized away).
It might work in some cases, it might not work in other cases, in the end you don't know what are you really measuring (hw, sw or a non trivial combination of these 2 aspects)
 
This is a flawed methodology cause you really don't know what is submitted to the hw for real (and it's not just about registers that one believes will not be optmized away).
It might work in some cases, it might not work in other cases, in the end you don't know what are you really measuring (hw, sw or a non trivial combination of these 2 aspects)

Well, on ATI through CTM you can actualy see the raw ISA hit the board.

On Nvidia, you can write in FP40 with all the registers dependent. Yes, this will be recompiled, but there are limitations to what the hardware/software can do. The test just does a stack of dependent ops (all muls, all adds, etc), using a dependedent register stack. We leave the instruction count to same, say 200 instructions, and change the number of registers specified, with the math initialized via constants. As the register count increases in the shader, we see drops in performance.

cgc actually hints at this somewhat with how aggressive it is at register usage. They will often increase shader length to avoid extra registers. You can also talk to game devs writing for Nvidia. We see this behavior in many GPGPU apps (HMMer, raytracing, many of the Brook tests). Register usage, especially on older chips, is a noticeable hit. As I've said NV4X and G7X behave much better than NV3X. As for G7x, our results could be effected by changes to the memory system (bandwidth better than doubles from NV4X for floating point formats)

Other references on register usage:
http://www.anandtech.com/printarticle.html?i=2031
http://download.nvidia.com/developer/GPU_Programming_Guide/GPU_Programming_Guide.pdf
http://developer.download.nvidia.co...arch="nvidia register usage performance drop" (page 35)

Note the last two don't get into specifics, but are careful to say that register usage has performance impacts.
 
I'm well aware of all these issues since I'm a game dev working on nvidia hw..you should not even trust your shader length, it might be not representative of what you want to measure or what you think you're measuring.
 
Last edited:
Both vendors will optimize the shader you give them, unless you have raw access to the hardware via something like CTM on ATI, or are a "blessed" developer and can see the output shader from the driver.

However, you can write shaders that are all but impossible to optimize.
 
I have seen GPUBench's issue results and there is a question, about why the nVidia is able to do 'sub' that fast on G7x. Is that possible? :)
 
Ah, we found that they were detecting were agressively, but correctly, optimizing away the shader. The test has been fixed, but the results not updated as we don't have all the hardware anymore.

We also found a bug in the readback code that was preventing the results from verifying correctly with later drivers. (Using GL_FLOAT as the type and RGBA isn't good enough, you MUST use Nvidia's format enums...)

Testing RCP is kinda the same issue. How do you write a dependent RCP shader, with just R in which you can't just simplify the shader to #RCP%2 number of RCP's?
 
Last edited by a moderator:
Back
Top