JohnH said:
One of the key things here is that as they're just bound they can be addressed through a data cache, so you no longer need to load all constants upfront and you don't need a full set of 4Kx16 4D constant registers.
John.
Yes, I think "cached" may be an appropriate way to think of these.
Since the set of all CBs seems like it can be quite large (hundreds or thousands of KB), it seems they're stored in local memory. As they're bound to shaders for execution it seems that (portions of) the CBs are brought from local memory into, presumably, their place in the "unified register file".
They'll presumably only be evicted from the register file when all the shaders using them are evicted.
Rys's earlier comment about R600 seems to imply that R600 has a dedicated, and fairly small, buffer cache to hold CBs, rather than allowing them to consume the actual register file's memory. I suppose this is workable on the basis that if CB-data is not in-GPU, a "latency-hiding" thread swap can be performed to enable the required CB-data to be read-in (as long as there are other shader instructions that can run, which aren't dependent on this missing CB-data!).
It also presumably makes it much easier for R600 to thread-schedule - when the command processor creates batches, it knows how many registers per object (fragment, say) to assign (and therefore how many batches it can launch), but it doesn't know how many elements from the set of bound CBs will be used. If the CBs used the register file, their memory consumption would cause the command processor to have to be over-conservative. It would have to assume that the entirety of all bound CBs is to be accessed.
(You could argue that CBs could be "paged" into the register file, relieving the command processor's conservatism.)
There's an interesting comment in the new document:
page 13 - Texture Buffers said:
A texture buffer is a buffer of shader constants that is read from texture loads (as opposed to a buffer load). Texture loads can have better performance for arbitrarily indexed data than constant buffer reads.
as it turns out that there are 3 choices of "constant" data:
- Constant Buffers - named constant arrays of up to 4096 elements, each of 128b
- Texture Buffers - named arrays of 128b elements, seemingly as large as a texture, 8192x8192, but always accessed on a per-element basis (i.e. no filtering) directly from memory
- Textures - good old fashioned textures, point-sampled or filtered, including 128b formats such as integer 32 or FP32
So it seems the latter two are explicitly "local memory" only data structures, relying upon the latency-hiding capability of the GPU in order to maintain performance. It's notable that there's a joint limit of 128 TBs and Textures that can be bound to a shader, so there's a common element in the access path for them both.
Conversely access to CBs is assumed to be effectively instantaneous, just like temporary registers. I guess the comment about TBs having better performance for arbitrarily indexed data is recognition of the fact that CBs are competing for a heavily constrained portion of GPU memory - apparently only enough for 1024 elements in R600. So developers will prolly find themselves consigning the larger CBs to TBs, retaining CBs for the core of constant data that's small, or that changes very frequently (e.g. multiple times per frame).
It's also notable that a render target or stream out target can be copied to populate a CB. That in itself is a pretty big clue that CBs are designed to live in local memory until they're bound for usage in a shader - whereupon portions will be brought into the GPU as required.
Jawed