Interesting info in "The Direct3D 10 System (SIGGraph 2006)"

psurge · Jun 16, 2006

Chalnoth - I'm not sure I totally uderstand it either, but for what it's worth: the diagram in the PDF seems to imply there are still only 32 temp registers that can be accessed without an index. The remaining registers are indexable temps, which I take to be roughly equivalent in spirit to a memory load from scratch-pad RAM.

So I think the idea is basically that you already have quite a lot of registers (== fast on chip memory) available in a current gen chip (32KB or more, Jawed has the details there), and there's no reason why a single fragment/vertex/whatever should be limited to using only 512 bytes of it (32x4 registers). A really complicated shader program might map most of the register file to "indexable temps" (== scracth space) to the rest of the register file. The trade off would be that a thread requiring so much temp space would eventually end up hogging a whole shader unit to itself. At that point the shader array begins looking more like a CPU (lots of resources/memory devoted to a single thread).

DudeMiester · Jun 16, 2006

As I said in an earlier thread, one definite use for the constant arrays in texture pallettes. For example, these could be used to enable 128-bit textures on most surfaces without eating up too much RAM.

Demirug · Jun 16, 2006

DudeMiester said:
As I said in an earlier thread, one definite use for the constant arrays in texture pallettes. For example, these could be used to enable 128-bit textures on most surfaces without eating up too much RAM.

Yes, but then you have to implement Texture filtering in the shader.

DudeMiester · Jun 16, 2006

Given that 128-bit texture would almost certainly have no filtering regardless, I don't see this as a huge issue. I think the memory usage is the bigger concern, which this could solve.

JohnH · Jun 16, 2006

Chalnoth said:
Well, it looks to me like these are just another type of texture, defined differently for hardware optimization reasons (due to different usage patterns). At 64kb a pop, you could have quite a lot of these stored on the GPU indeed. This sounds rather distinct from the temporary register array to me.

But the 4k temporary values still sound insane to me. That's 64kb just for one in-flight pixel. The only way the hardware could have any reasonable number of pixels in flight with this large of a temporary buffer would be if the temporary register arrays were stored in external memory.

Chalnoth,

the constants are indeed completly seperate from the temporary registers. The constant buffers actually fix the problems that Dx9 caused with constant management within the driver (read part of Dx9 batch overhead). Similar to GLSL's seperation of local and global uniforms but more flexible in that a constant buffer may be freely bound to any shader (up to a limit of 16 to a single shader), they basically minmise the requirement for all the constant shuffling and copying that Dx9 drivers where forced to do.

On the temps, yes 4K is insane, but the way to look at this is that this implies nothing about the number of registers that the pipeline need to support internally, it just says that a shader may use up to 4K variables. As I said in my last post this prevents the HLSL compiler from messing with the vendors low level compilers ability to trade temp space of against number of execution threads and teh amount of register spill. Of course some may question why place an arbitrary 4K limit in this case...

John.

KimB · Jun 16, 2006

JohnH said:
Chalnoth,

the constants are indeed completly seperate from the temporary registers. The constant buffers actually fix the problems that Dx9 caused with constant management within the driver (read part of Dx9 batch overhead). Similar to GLSL's seperation of local and global uniforms but more flexible in that a constant buffer may be freely bound to any shader (up to a limit of 16 to a single shader), they basically minmise the requirement for all the constant shuffling and copying that Dx9 drivers where forced to do.

Right, I see that, and it's a good move.

On the temps, yes 4K is insane, but the way to look at this is that this implies nothing about the number of registers that the pipeline need to support internally, it just says that a shader may use up to 4K variables. As I said in my last post this prevents the HLSL compiler from messing with the vendors low level compilers ability to trade temp space of against number of execution threads and teh amount of register spill. Of course some may question why place an arbitrary 4K limit in this case...

Well, you have to have a limit somewhere, because if you don't, then the hardware has to have the capability to spill the register file over into video memory. But it seems to me that 4k is a bit excessive. I expect that this number will basically place a minimum size of the register file as 4k total temporary registers in flight, in which case no programmer in their right mind would attempt to use anywhere close to that number of temporaries (with a more realistic upper limit being at around 32).

Jawed · Jun 16, 2006

Each constant buffer is effectively an array of constants with max 4096 elements, where each element can hold 128 bits (RGBA as FP32).

Each shader can use any 16 of these arrays that you define, and obviously each can be any size upto 4096. You can define hundreds of separate arrays of these constants - logically, they're loaded into the GPU independently of the shaders that use them. They're not per-pixel (or per-vertex or per primitive). They're simply "there". They don't, collectively, constitute a scratchpad in the normal sense, since they can only be written by the CPU not by the GPU.

Jawed

DemoCoder · Jun 16, 2006

The # of temporary registers shouldn't even be exposed, as that is an aspect of the assembly. I view the DX HLSL architecture as broken compared to GLSL because the compiler should be provided by the driver runtime, not run as a separate static command line utility with hardcoded profiles. HLSL is biased because it makes generic optimizations without regard to underlying hardware knowledge, spitting out assembler based on profiles, and then expects the driver the clean up the mess by optimizing the assembly. Register allocation should not be done by MS's compiler.

It is well known that there is no one-size-fits-all intermediate representation for writing compilers, yet DX HLSL dictates one (even GCC relented lately and introduced multiple representations). MS would better serve IHVs if they made their compiler very pluggable at every stage, and allowed driver manufacturers to extend it, preferably distributing source to IHVs.

PeterAce · Jun 16, 2006

DemoCoder said:
The # of temporary registers shouldn't even be exposed, as that is an aspect of the assembly. I view the DX HLSL architecture as broken compared to GLSL because the compiler should be provided by the driver runtime, not run as a separate static command line utility with hardcoded profiles. HLSL is biased because it makes generic optimizations without regard to underlying hardware knowledge, spitting out assembler based on profiles, and then expects the driver the clean up the mess by optimizing the assembly. Register allocation should not be done by MS's compiler.

It is well known that there is no one-size-fits-all intermediate representation for writing compilers, yet DX HLSL dictates one (even GCC relented lately and introduced multiple representations). MS would better serve IHVs if they made their compiler very pluggable at every stage, and allowed driver manufacturers to extend it, preferably distributing source to IHVs.

Indeed. Humus stated that sometimes a CG shader could be faster than an DX HLSL because the CG shader had less generic optimisations, which left the driver runtime compiler more opportunity to optimise!

Demirug · Jun 16, 2006

DemoCoder said:
The # of temporary registers shouldn't even be exposed, as that is an aspect of the assembly. I view the DX HLSL architecture as broken compared to GLSL because the compiler should be provided by the driver runtime, not run as a separate static command line utility with hardcoded profiles. HLSL is biased because it makes generic optimizations without regard to underlying hardware knowledge, spitting out assembler based on profiles, and then expects the driver the clean up the mess by optimizing the assembly. Register allocation should not be done by MS's compiler.

It is well known that there is no one-size-fits-all intermediate representation for writing compilers, yet DX HLSL dictates one (even GCC relented lately and introduced multiple representations). MS would better serve IHVs if they made their compiler very pluggable at every stage, and allowed driver manufacturers to extend it, preferably distributing source to IHVs.

The binary representation of a DX shader should not be understood as assembler code like that one we feed to our CPUs. Itâ€™s more like a byte code that we know from Java or .Net. I agree that during the compilation from a high level language to the byte code some information is lost but we get some advantages from this step.
Parsing and optimizing the byte code is much faster than the same step with high level code. If you have only a small number of shaders you may not care but with a large set it can significant improve the load time for a level.

After the compiler (which is part of D3DX and not only a command line tool) have generate a shader it is approved for this model. As a game developer you can now be sure that it will run on every chip that claim support for this model. In the case it doesnâ€™t you know exactly who you have to blame. A high level language gives you more freedom but it gives the driver the right to reject your shader if it wants.

The next point is not technical at all but there is still a big paranoia in the game development scene when it comes to distribute shader in high level form. I am still remember that there was some harsh words after the first presentations of D3D10 (it was called DX Next at this timeframe). Many people donâ€™t want to give their shader sources away.

AFAIK with they SM4 profile the compiler will try less clever tricks to keep more information in the byte code.

JohnH · Jun 16, 2006

DemoCoder said:
The # of temporary registers shouldn't even be exposed, as that is an aspect of the assembly. I view the DX HLSL architecture as broken compared to GLSL because the compiler should be provided by the driver runtime, not run as a separate static command line utility with hardcoded profiles. HLSL is biased because it makes generic optimizations without regard to underlying hardware knowledge, spitting out assembler based on profiles, and then expects the driver the clean up the mess by optimizing the assembly. Register allocation should not be done by MS's compiler.

It is well known that there is no one-size-fits-all intermediate representation for writing compilers, yet DX HLSL dictates one (even GCC relented lately and introduced multiple representations). MS would better serve IHVs if they made their compiler very pluggable at every stage, and allowed driver manufacturers to extend it, preferably distributing source to IHVs.

This is true, but the reality for any real time applicationmay be that 4096 is a high practical limit, and note that the HLSL may not do any register allocation as this is an absolute limit on the number of "variables" in the program. This should mean that real register allocation is now more in line with it being done by the vendors low level compiler.

The other side of this of course is that exposing an arbitrary number woudl also remove the need for ISV to actually think about they're doing. For example a program 4Kx4D vec registers will invariably spill some of those registers to memory, a this point efficiency might dictate that it would be better to look at multi-passing. All algorithm dependent of course.

John.

John.

JohnH · Jun 16, 2006

Chalnoth said:
Right, I see that, and it's a good move.

Well, you have to have a limit somewhere, because if you don't, then the hardware has to have the capability to spill the register file over into video memory. But it seems to me that 4k is a bit excessive. I expect that this number will basically place a minimum size of the register file as 4k total temporary registers in flight, in which case no programmer in their right mind would attempt to use anywhere close to that number of temporaries (with a more realistic upper limit being at around 32).

Ammusingly enough, after compiler register allocation I don't think I've seen a "real" shader that uses more than 9 or 10 4D vec registers, but the numbers are slowly increasing as teh amount of practical grunt goes up.

John.

JohnH · Jun 16, 2006

Jawed said:
Each constant buffer is effectively an array of constants with max 4096 elements, where each element can hold 128 bits (RGBA as FP32).

Each shader can use any 16 of these arrays that you define, and obviously each can be any size upto 4096. You can define hundreds of separate arrays of these constants - logically, they're loaded into the GPU independently of the shaders that use them. They're not per-pixel (or per-vertex or per primitive). They're simply "there". They don't, collectively, constitute a scratchpad in the normal sense, since they can only be written by the CPU not by the GPU.

Jawed

One of the key things here is that as they're just bound they can be addressed through a data cache, so you no longer need to load all constants upfront and you don't need a full set of 4Kx16 4D constant registers.

John.

Jawed · Jun 16, 2006

Just found this more detailed document from SIGGraph 2006, also by David Blythe:

http://www.csee.umbc.edu/~olano/s2006c03/ch02.pdf

Jawed

KimB · Jun 16, 2006

JohnH said:
Ammusingly enough, after compiler register allocation I don't think I've seen a "real" shader that uses more than 9 or 10 4D vec registers, but the numbers are slowly increasing as teh amount of practical grunt goes up.

John.

Well, obviously you'd only reach such a ridiculous amount by specifically attempting to do work on one or more arrays of values at a time. One quick example would be attempting to perform a vector-matrix-vector multiplication resulting in a single number. If you were to do it in one pass with a single pixel or vertex program in the simple and obvious way, it would require as many temporary values as the dimensionality of the matrix (just for simplicity, assume it's square).

So you could do it, clearly, but writing any such program would be really stupid, because it'd run like crap, and there are much better ways to do it on a GPU. This particular feature seems, to me, to be nothing more than a potential killer of performance.

Jawed · Jun 16, 2006

JohnH said:
One of the key things here is that as they're just bound they can be addressed through a data cache, so you no longer need to load all constants upfront and you don't need a full set of 4Kx16 4D constant registers.

John.

Yes, I think "cached" may be an appropriate way to think of these.

Since the set of all CBs seems like it can be quite large (hundreds or thousands of KB), it seems they're stored in local memory. As they're bound to shaders for execution it seems that (portions of) the CBs are brought from local memory into, presumably, their place in the "unified register file".

They'll presumably only be evicted from the register file when all the shaders using them are evicted.

Rys's earlier comment about R600 seems to imply that R600 has a dedicated, and fairly small, buffer cache to hold CBs, rather than allowing them to consume the actual register file's memory. I suppose this is workable on the basis that if CB-data is not in-GPU, a "latency-hiding" thread swap can be performed to enable the required CB-data to be read-in (as long as there are other shader instructions that can run, which aren't dependent on this missing CB-data!).

It also presumably makes it much easier for R600 to thread-schedule - when the command processor creates batches, it knows how many registers per object (fragment, say) to assign (and therefore how many batches it can launch), but it doesn't know how many elements from the set of bound CBs will be used. If the CBs used the register file, their memory consumption would cause the command processor to have to be over-conservative. It would have to assume that the entirety of all bound CBs is to be accessed.

(You could argue that CBs could be "paged" into the register file, relieving the command processor's conservatism.)

There's an interesting comment in the new document:

page 13 - Texture Buffers said:
A texture buffer is a buffer of shader constants that is read from texture loads (as opposed to a buffer load). Texture loads can have better performance for arbitrarily indexed data than constant buffer reads.

as it turns out that there are 3 choices of "constant" data:

Constant Buffers - named constant arrays of up to 4096 elements, each of 128b
Texture Buffers - named arrays of 128b elements, seemingly as large as a texture, 8192x8192, but always accessed on a per-element basis (i.e. no filtering) directly from memory
Textures - good old fashioned textures, point-sampled or filtered, including 128b formats such as integer 32 or FP32

So it seems the latter two are explicitly "local memory" only data structures, relying upon the latency-hiding capability of the GPU in order to maintain performance. It's notable that there's a joint limit of 128 TBs and Textures that can be bound to a shader, so there's a common element in the access path for them both.

Conversely access to CBs is assumed to be effectively instantaneous, just like temporary registers. I guess the comment about TBs having better performance for arbitrarily indexed data is recognition of the fact that CBs are competing for a heavily constrained portion of GPU memory - apparently only enough for 1024 elements in R600. So developers will prolly find themselves consigning the larger CBs to TBs, retaining CBs for the core of constant data that's small, or that changes very frequently (e.g. multiple times per frame).

It's also notable that a render target or stream out target can be copied to populate a CB. That in itself is a pretty big clue that CBs are designed to live in local memory until they're bound for usage in a shader - whereupon portions will be brought into the GPU as required.

Jawed

JHoxley · Jun 16, 2006

Despite starting this thread, I've been pretty busy for the last couple of days - some interesting points raised.

On the topic of the 4k constant buffers... remember that they are constant - that is, they're the same for an entire shader across a draw-call. I suppose some hardware might need to keep multiple copies, but theoretically only 1..16 of those CB's needs exist regardless of how many "in flight" vertices/triangles/pixels there are.

Jawed said:
Just found this more detailed document from SIGGraph 2006, also by David Blythe:

http://www.csee.umbc.edu/~olano/s2006c03/ch02.pdf

Jawed

Interesting. That document looks amazingly similar to the actual D3D10 specification in places. I've not checked it out closely, so it might not be... A good find :smile:

Jack

Jawed · Jun 16, 2006

JHoxley said:
On the topic of the 4k constant buffers... remember that they are constant - that is, they're the same for an entire shader across a draw-call. I suppose some hardware might need to keep multiple copies, but theoretically only 1..16 of those CB's needs exist regardless of how many "in flight" vertices/triangles/pixels there are.

You'd presumably have at least one VS, one GS and one PS all concurrently loaded and executing. Each would have as many as 16 CBs bound to it. In a unified architecture I guess all the CBs would find themselves competing for a single patch of memory, whereas a non-unified architecture would keep them separated.

And presumably it gets more exciting in future when there are multiple contexts executing concurrently.

Additionally CBs are designed to be changing all the time. This creates a requirement to "double-buffer" the CBs so that the old CBs are retained as an old shader finishes using them, while a new shader can start immediately with the new CBs.

Jawed

KimB · Jun 16, 2006

I seriously doubt there will be any set limit in hardware as to how many total constant buffers can be allocated at any given time. It'd make more sense to deal with the memory management of constant buffers just like it's done with textures.

As for the reading, though, I would expect that the typical scenario would be where you execute the same instruction on many different pixels in sequence or in parallel. This would make pre-fetching of data from the constant buffers fairly easy, and would also seem to mean that many pixels would make use of the same value from the constant buffer in sequence, making on-chip caching of constant buffer values a trivial exercise.

Demirug · Jun 16, 2006

Jawed said:
Additionally CBs are designed to be changing all the time. This creates a requirement to "double-buffer" the CBs so that the old CBs are retained as an old shader finishes using them, while a new shader can start immediately with the new CBs.

Jawed

Yes, but this problem is not much different from what we have today with the constant registers that change all the time, too. But as you cannot map a constant buffer to system memory and have to update it always complete it will be much easier to add the necessary commands and data to the DMA command buffer.

I have the strong felling that at least dynamic constant buffers are not allocated in the video ram.

Interesting info in "The Direct3D 10 System (SIGGraph 2006)"

Similar threads