Interesting info in "The Direct3D 10 System (SIGGraph 2006)"

JHoxley

Regular
Hi all,

I saw this a few days ago, but it now seems to be public on the DirectX Developer Center - "The Direct3D 10 System" (PDF file).

Given you all seem to like numbers, its the little summary in section 6 (bottom of page 9) that you might find interesting - a comparison of the API overheads between D3D9 and D3D10...

Seems to back up the occasionally mentioned "fact" that D3D10 is 10x faster - Draw calls drop from 1470 to 154 which is 9.5x faster :smile:

Figured I'd post it here in case you hadn't seen it already and were interested in some more D3D10-related details....

Cheers,
Jack
 
Wow, 4096 temporary registers? I never noticed that before. I wonder how that's going to be managed?
 
JHoxley said:
Hi all,

I saw this a few days ago, but it now seems to be public on the DirectX Developer Center - "The Direct3D 10 System" (PDF file).

Given you all seem to like numbers, its the little summary in section 6 (bottom of page 9) that you might find interesting - a comparison of the API overheads between D3D9 and D3D10...

Seems to back up the occasionally mentioned "fact" that D3D10 is 10x faster - Draw calls drop from 1470 to 154 which is 9.5x faster :smile:

Figured I'd post it here in case you hadn't seen it already and were interested in some more D3D10-related details....

Cheers,
Jack

Thanks for the document. Although I didn't understand a lot of it, from what I could gather, the changes are pretty significant over D3D9. I'm hoping that D3D9.L has some of the same performance benefits, though from that document it seem that a major part of the opitmization, namely the 4*4096 pool of registers and the associated change in granularity of state changes, will not be possible to exploit with current gen hardware. What do you think?
 
Chalnoth said:
Wow, 4096 temporary registers? I never noticed that before. I wonder how that's going to be managed?

Using Constant Buffers, they are basically a series of vec4f which map to the GPU registers one on one. (That's the idea anyway)

4096*4*4 = 64KB of registers (should it really be a direct mapping)
...
 
Render to vertex buffer seems to be a feature of D3D10 - on page 4 it seems to be described as being limited to a 128KB output buffer, but able to support scatter (random writes). It also suffers from the constraints of the data formats of Output Merge (ROP).

In comparison Stream Out is more flexible (more data formats), supports a larger amount of data (128MB :oops: ) but is strictly serial (i.e. scatter cannot be performed).

I guess that answers most of my questions in:

http://www.beyond3d.com/forum/showthread.php?t=31140

though the performance of Stream Out remains to be seen...

Jawed
 
Chalnoth said:
Wow, 4096 temporary registers? I never noticed that before. I wonder how that's going to be managed?

Expect HW to have somewhat less registers than this, and view Dx's exposure of 4096 as an optimisation opportunity for each vendors low level compiler wrt spill to other storage.

John.
 
I dare say it's the same as current SM3 hardware. Nominally there's support for 32x FP32s per fragment in flight, but the reality is performance drops off as more FP32s (or FP16s) are used (beyond a few, whatever the count actually is), causing the command processor to issue smaller batches on NVidia, or less batches on ATI.

Using all 32 registers in a pixel shader will make the SM3 GPUs crawl - though I think ATI's architecture is much less sensitive both because its register file is larger and because out of order threading at least allows the pipeline to deal gracefully with texturing latency.

Jawed
 
Single-pass Rendering to a Cubemap

When an array of render targets is bound to the OM, the target array index is computed for each primitive in the GS. This allows the GS to sort (or replicate) primitives into different array ele-ments. One example of this is rendering an environment to a cube map in a single rendering pass by treating the cube map as an array of 6 2D render targets. As the environment geometry is processed, the GS determines to which cube faces a primitive should be rendered and issues the primitive once for each face. Note that the GS render target array selection mechanism is inde-pendent (orthogonal) to the multiple render target outputs of the PS.
This seems to imply that each face of the cube map has its own depth/stencil. Is that correct? Guessing, I suppose that each face is simply a region in the backbuffer.

It also implies that each face of the cube map could result in output to multiple render targets. Is that true? Sounds pretty groovy, I guess that means you could, for example, generate per-pixel motion blur data for reflection cubemaps, all in one pass - and do a sweet final pass that renders the motion blurred reflection correctly from the gamer's perspective in the final scene. Something that was notably absent in PGR3...

Jawed
 
Last edited by a moderator:
Jawed said:
I dare say it's the same as current SM3 hardware. Nominally there's support for 32x FP32s per fragment in flight, but the reality is performance drops off as more FP32s (or FP16s) are used (beyond a few, whatever the count actually is), causing the command processor to issue smaller batches on NVidia, or less batches on ATI.

Using all 32 registers in a pixel shader will make the SM3 GPUs crawl - though I think ATI's architecture is much less sensitive both because its register file is larger and because out of order threading at least allows the pipeline to deal gracefully with texturing latency.

Jawed

As you suggest current HW tends to trade number of execution threads against number of registers per thread, none spill to other storage. By declaring a large number of temps as apposed to declaring, say, 32 you avoid the HLSL compiler messing with the optimisations available to the low level compiler when trading off between number of threads, temps and storage spills.

John.
 
And there's 1K of temporaries on R600 for all executing threads.
 
That's less than the pixel shader capacity of R580: 48 fragments x 3 FP32s x 128 threads (batches) x 4 shader units = 73728.

Something's not adding up...

Jawed
 
Jawed said:
That's less than the pixel shader capacity of R580: 48 fragments x 3 FP32s x 128 threads (batches) x 4 shader units = 73728.

Something's not adding up...

Jawed
So the register file just for pixel shading is upwards of 1MiB on R580? :oops:
 
Chalnoth said:
Wow, 4096 temporary registers? I never noticed that before. I wonder how that's going to be managed?
Temp registers are a virtual resource anyway. The number is totally meaningless.

The only "Wow" here is that the Direct3D designers still don't get it.
 
Actually imo the key descision was to not declare a shader model with an arbitrary smaller number of temps and the ability to spill to other storage as this would have resulted in the HLSL compiler defeating the vendors low level compilers attempts to optimise active threads vs storage per thead and when to spill.

John.
 
4096 Constants

Hmm, figure 3 seems to imply something different than what we're used to.

Each object (vertex, primitive, fragment) has access to 32 temporary registers (as in SM3).

But D3D10 has the concept of Constant Buffers. Each CB can contain upto 4096 FP32s (or integer 32s or variants). So if you have a "shiny happy bumpy glassy" shader that uses 8 constants and a "boring broken ridged glass" shader that uses 6 constants, these sets of constants are stored in distinct CBs. I don't know what the limit on CBs is...

Hmm, this prolly explains it better:


b3d53.jpg


from:

http://download.microsoft.com/download/2/2/b/22bfadd8-01b0-4fc4-942b-6e7b1635b214/Intro_to_Direct3D10.ppt

The PDF that we're discussing talks about how constants could have been implemented as textures (i.e. point sampling, therefore unifying all constant data as texels) - but they decided that the access patterns for constants don't match the access patterns for textures:

Constants are typically accessed at much higher frequencies than textures and often using indices that are uniform across sets of vertices or pixels, whereas textures are accessed at lower frequencies and with different indexes (texture coordinate values). This suggested that there were hardware implementation advantages to keeping constants and textures distinct.
Jawed
 
Last edited by a moderator:
Rys said:
So the register file just for pixel shading is upwards of 1MiB on R580? :oops:
Yep, something that I've enumerated more than once now :D

Rys said:
And there's 1K of temporaries on R600 for all executing threads.
I think this means that R600 supports 1024 CB elements - i.e. a long way short of that slide I've posted above.

4096 CB elements is 64KB of memory. That slide is suggesting that you can define 16 CBs per shader program (each with 4096 elements). The slide also suggests that you can have more than 16 CBs active on the GPU at one time.

That's just insane, we're talking about C0, C1 ... C65535 being usable by any single shader. And another shader loaded on the GPU having access to another set of 65536 constants, totally separate from the first set. etc.

So it seems to me that R600 is using a much smaller indexed temporary count (constants) of 1024. If you define more by virtue of a set of shaders concurrently loaded, then I suppose that's going to cause swapping to/from local memory (VRAM).

It seems analogous to R580 only having enough register file for 3 FP32s per fragment, not 32.

Jawed
 
wow. alot of this is way over my head.

what does this all mean for Direct3D 10 API, GPUs and games during say, the 2007 timeframe ?

small leap, medium leap, large leap, beyond current DX9 SM3.0 games ?
 
Jawed said:
Yep, something that I've enumerated more than once now :D
I'd forgotten you were certain on 3 temps per fragment. I still go by 2. Time to switch!

Jawed said:
I think this means that R600 supports 1024 CB elements - i.e. a long way short of that slide I've posted above. *snip*
Checking my notes, I think that's it :)
 
Here's a nice worked example, by Jack, of a shader that implements a simple materials system. It supports up to 16 different materials:

http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=316777&reply_id=2521733

(section A simple attribute-range based shader running entirely on the GPU)

and applies the material depending on which face of the object is being pixel shaded (the face being shaded effectively falls into one of up to 16 ranges of face IDs). So there's 33 constants, in effect, being used by the shader (49 if you're being picky).

When you look at it like that, you can see how quickly you might use up all 65536 constants that are available to each shader :p

---

Rys, on the 2 v 3 thing, you know what, I'm now completely uncertain which :devilish:

Jawed
 
Last edited by a moderator:
Jawed said:
4096 CB elements is 64KB of memory. That slide is suggesting that you can define 16 CBs per shader program (each with 4096 elements). The slide also suggests that you can have more than 16 CBs active on the GPU at one time.

That's just insane, we're talking about C0, C1 ... C65535 being usable by any single shader. And another shader loaded on the GPU having access to another set of 65536 constants, totally separate from the first set. etc.
Well, it looks to me like these are just another type of texture, defined differently for hardware optimization reasons (due to different usage patterns). At 64kb a pop, you could have quite a lot of these stored on the GPU indeed. This sounds rather distinct from the temporary register array to me.

But the 4k temporary values still sound insane to me. That's 64kb just for one in-flight pixel. The only way the hardware could have any reasonable number of pixels in flight with this large of a temporary buffer would be if the temporary register arrays were stored in external memory.
 
Back
Top