Question about constant waterfallig

shuipi

Newcomer
from what I understand it happens when different fragments which are supposed to be runing the same instruction stream can't do so anymore due to the fact that the instruction they're about to execute together needs different constant registers as operand across these fragments. this coul happen when, say, you're indexing an array of constant registers in the hlsl source, but the index is not know at compile time and is dependent on the HPOS or tex coord of the pixel.

What's your opinions?
 
What is the question?

I like to call it constant divergence. The real important thing to ask is if current generation hardware can hide this latency ie run another vector of threads while the divergent constant fetch happens in the background.

And this is something I haven't bothered to profile yet.
 
this coul happen when, say, you're indexing an array of constant registers in the hlsl source, but the index is not know at compile time and is dependent on the HPOS or tex coord of the pixel.
Yeah.

In GPGPU programming you get a similar problem if you try to do indexed register access or indexed memory fetches or writes. The latter case is often referred to as gather and scatter and GPUs are getting better at sorting and grouping to maximise the efficiency of these operations (i.e. to lower the number of transactions against memory - though there is then a latency trade-off to beware of).

But with constants and registers there's a fundamental problem with the organisation of on-die constant cache and register file memory which simply means the SIMD waits until all objects have got their data. The wait might be until the SIMD-width is full, or it might be until the batch width is full. Varies according to GPU it seems.

So a 16-wide SIMD can slow down to 1/16th throughput if all the cache/register indices end up addressing distinct "lines" in memory.

I don't know of any GPU that solves the cache/register problem by "hiding latency". They only do so for memory accesses, in some cases.

Jawed
 
But what is the question?

Constant waterfalling can be a significant performance problem and AFAIK affects all GPUs on the market today. If you have fairly divergent indexes into the constant registers in a thread you will probably achieve better performance using texture lookups.
 
To simplify:
- For small constant arrays (for example: array of four quad corners in particle renderer), the waterfalling is not an issue.
- For larger constant arrays (for example: 64 bone matrix array on a single mesh), the waterfalling cost is very much noticeable.
- If you are using a large constant array, but using the same indices a lot in a thread friendly manner (for example: a 256 material ID array in deferred rendering), the constant waterfalling is not usually an issue (depends on your access pattern really).

Solutions to prevent constant waterfalling in vertex shader:
- Use multiple vertex streams. This is most benefical if the hardware / API supports vfetch with user generated vertex index.
- Point sample textures. Useful for hardware that supports texture sampling in vertex shader (usually a bit slower than using vertex stream index read).
- Use render to vertex buffer (R2VB) to render your constants to a second render stream. Most useful if you have massive amount of dynamic constants.

Solutions to prevent constant waterfalling in pixel shader:
- Store your constants to a lookup texture and sample it with point sampling.
 
Back
Top