DX10 Checklist: What made it into D3D10? What was cut?

Bob · May 7, 2006

stevem said:
Is that a general D3D10 pipeline issue, refrast issue, or IHV implementation specific?

Well, think about it this way: When running a geometry shader, you still want rasterization, blending and other ROP operations to happen in the same order as the polygons sent by your app. But since geometry shaders can create geometry, you effectively need to serialize them.

So, the limit on how much you can parallelize geometry shaders is determined almost exclusively by how much on-board memory you're willing to dedicate to storing geometry shader outputs.

For example, if a geometry shader thread outputs 8 vertices of 8 vec4 worth of data, you need 1 KB of memory per thread. If all you have is a 32 KB buffer, for example, you can only run 32 GS threads. That's not very many. If you have 8 GS units, you only get 4 threads to cover both the ALU and texturing latency.

If your GS outputs more vertices, or more attributes, you can run even less threads (or need more memory to store the results).

To cover texturing latency, you really want several hundred threads running. That can easily push your on-board memory needs to 0.5 MB - 1 MB. That's as large as whole CPU caches!

Several solutions are possible, but all have significant downsides: equip GPUs with a boatload of memory (costs $$$), be able to spill to DRAM (but that has plenty of issues onto itself), remove the ordering constraint (possible random screen corruption, unexpected behaviors, etc), implement some way to sort fragments in the ROP (may require as much, if not more memory than the current scheme), etc.

psurge · May 7, 2006

This may be a dumb question, but what are the use cases for doing read/write on the current render target with complex self-overlapping geometry (and not e.g. screen-sized quads)?

AFAICS a fast solution for the non-overlapping case would already be very useful. I guess this would be something along the lines of "provide the current framebuffer value, but punt on coherency completely".

Another thought - instead of providing HW for serializing shader execution, could render target read/writes be treated more like very limited load/store instructions in a CPU? For example, if one were to provide an atomic CAS instruction (limited to the contents of a pixel's render targets), then it should be possible to implement anything from no coherency at all to shader serialization (via locking) in the shader code itself.

Edit: grammar and sp.

JHoxley · May 7, 2006

Yes but a debug message that it write out during start let me believe that the D3D10 version have multicore optimizations.

D3D10 is, by default, thread aware/safe now. Now it has its layers mechanism I think you have to actually opt out of this feature.

Which brings me on to another cool feature that I've yet to be able to try... From the docs I've got, you should be able to swap in/out the RefRast dynamically at runtime. Which would make some debugging a bit easier.

what are the use cases for doing read/write on the current render target

Programmable blending is the primary one... I can't think of any specific effects off the top of my head, but the OM stage is one of the few remaining fixed-function sections that can really make a difference to how the final result is written...

If your GS outputs more vertices, or more attributes, you can run even less threads (or need more memory to store the results).

To cover texturing latency, you really want several hundred threads running. That can easily push your on-board memory needs to 0.5 MB - 1 MB. That's as large as whole CPU caches!

The GS is required to allow up to 1024 values to be written out.. thus a single invocation of a GS could output 4kb of data that the GPU must handle.

From the developer point of view the IMHO biggest change is the banishing of the Caps.

That's what people would like to believe, but I'm sceptical. Maybe somewhat fewer paths, but different paths are most likely going to be a reality in the future as well as long as hardware has different performance characteristics.

I'm with Humus on this one. The fixed-caps stuff is here to stay from everything I've heard, but it isn't some magical solution - at best it solves ~50% of the configuration/compatability problems.

I expect "performance related caps" to become a bigger thing in the future - they already exist now (a GfFX5200 does SM2, just piss-poor slow!) so it wont be too painful.

I have read with DX10 you can now do a cube map in a single pass and what not.

Yes, this part of the more abstract resource views and arrays. The single-pass cube mapping is a great example because its easy to see the advantage, but its far from the only use.

I've not tried it yet, but I think I can use the same technology to fold the classic fresnel-weighted reflect/refract water effect into 2 passes via D3D10. Same thing should take 4 passes with D3D9. It'll be difficult to compare performance - but the general efficiency of the new API as well as the flexibility in allowing me to be more "direct" about implementing algorithms should yield big performance wins. That means you have performance budget to invest elsewhere - either simply higher resolutions and MSAA or you can have more effects in more places.

Okay, a couple more features that haven't been mentioned yet:

â€¢ Effect framework moves into the core runtime, is leaner and meaner

â€¢ All shader authoring is now in HLSL, no more assembly shaders

â€¢ Comparison filtering methods in PS - great for PCF/shadow mapping stuff. I've not tried it extensively, but the bits-n-pieces I've read about D3D9 shadow mapping is that it can rely on various IHV "quirks" and features to get the best performance/quality. Having it mandated by the core runtime and equal across chipsets strikes me as a big win.

â€¢ Material systems being run on the GPU. I wrote a mini-article about this and want to push the work a bit harder - I think its got a lot of potential

â€¢ The fixed and very rigourously defined calculation/computation rules should not be underestimated.

Think thats all I've got for now.
Jack

Jawed · May 7, 2006

As a GS generates vertices, does it output them serially (e.g. the shader might consist of a while loop, and each iteration of the loop outputs a vertex or set of vertices (primitive))?

Or are all the vertices dumped in one lump as the final output?

Streamout plays an optional part here, doesn't it? In that case, data would be output serially, wouldn't it?

Jawed

arjan de lumens · May 7, 2006

psurge said:
This may be a dumb question, but what are the use cases for doing read/write on the current render target with complex self-overlapping geometry (and not e.g. screen-sized quads)?

Dunno - volumetric fog with non-convex fog volumes and other forms of depth-difference-based blending, light polarization effects, non-HDR-framebuffer-dependent tone mapping, color-key effects? (those are the ones that I can think of straight away; a game developer may be able to give a more comprehensive list).

AFAICS a fast solution for the non-overlapping case would already be very useful. I guess this would be something along the lines of "provide the current framebuffer value, but punt on coherency completely".

For the non-overlapping case, I would expect it to be possible to get reasonable performance by just playing render-to-texture tricks with present-day hardware - at least if you are processing large regions in one go and do not mind loss of anti-aliasing.

Another thought - instead of providing HW for serializing shader execution, could render target read/writes be treated more like very limited load/store instructions in a CPU? For example, if one were to provide an atomic CAS instruction (limited to the contents of a pixel's render targets), then it should be possible to implement anything from no coherency at all to shader serialization (via locking) in the shader code itself.

Not very likely. If you try to use such an instruction for locking, you still get a race condition when two overlapping fragments in the pixel shader pipeline try to lock the same framebuffer pixel.

Demirug · May 7, 2006

JHoxley said:
D3D10 is, by default, thread aware/safe now. Now it has its layers mechanism I think you have to actually opt out of this feature.

I know. For my managed layer I have touched nearly every single method, enumeration and function in the API. But this multi threading message was from the Refrast and not from the runtime.

JHoxley said:
I'm with Humus on this one. The fixed-caps stuff is here to stay from everything I've heard, but it isn't some magical solution - at best it solves ~50% of the configuration/compatability problems.

Even in this case 50% are better than nothing. With DX9 you have to different caps and different performances. D3D10 eliminates at least the caps problem.

JHoxley said:
I expect "performance related caps" to become a bigger thing in the future - they already exist now (a GfFX5200 does SM2, just piss-poor slow!) so it wont be too painful.

Maybe WinSAT will help with this a little bit.

3dcgi · May 8, 2006

Jawed said:
As a GS generates vertices, does it output them serially (e.g. the shader might consist of a while loop, and each iteration of the loop outputs a vertex or set of vertices (primitive))?

Or are all the vertices dumped in one lump as the final output?

Streamout plays an optional part here, doesn't it? In that case, data would be output serially, wouldn't it?

Jawed

One vertex is emitted at a time so if you want to output 32 vertices it will take at least 32 clocks. Of course the hardware can be processing multiple GS prims in parallel. I'm talking about the output from a single GS primitive.

psurge · May 8, 2006

arjan de lumens said:
Not very likely. If you try to use such an instruction for locking, you still get a race condition when two overlapping fragments in the pixel shader pipeline try to lock the same framebuffer pixel.

Yes that was a brainfart on my part. The HW would need to provide some kind of guarantees anyway. If primitive B follows A but pixels from B finish before those from A are even scheduled, then you're hosed no matter what you do... So altogether a stupid idea.

I don't know what most of your examples actually are :smile:. For the concave fog volume case though, here's my idea: For each front facing polygon, compute the difference between pixel z and scene z and add it to the current covered z range. Backfacing polygons, same thing but subtract from the current covered z range. It doesn't really matter which order the additions / subtractions occur in (assuming the accumulator in use has sufficient range and precision)... Now I'm sure a realworld task is much more complicated than that, but if for whatever reason submission order does matter (say you wanted to consider depth intervals back to front), then won't you run into problems from z sorting errors anyway? Or are those generally speaking hard to spot when mostly transparent things are involved?

Nom De Guerre · May 8, 2006

Demirug said:
From the developer point of view the IMHO biggest change is the banishing of the Caps. This will free us from the need to develop different code path/effects for every GPU family from the different vendors. Unfortunately this will have a price. No old GPU can be used with D3D10 you are forced to stay with D3D9 for this chips.

I would suspect that this actually wouldn't be something entirely new for developers.

This matter of the disappearing CAPs where developers are concerned is strictly a matter of development schedule. I expect most developers will stick with D3D9 for a while but a couple projects down the line, the target market will look good enough for D3D10 because the market will be flushed with plenty of D3D10 hardware then (as per history). This is the way it always has been. Your post infers an assumption that the CAPs mater will affect or determine if D3D games will be abundant quickly or slowly when D3D10 makes its debut. I'm sure you know this cannot be the case since you are a developer

JHoxley · May 8, 2006

Jawed said:
As a GS generates vertices, does it output them serially (e.g. the shader might consist of a while loop, and each iteration of the loop outputs a vertex or set of vertices (primitive))?

Or are all the vertices dumped in one lump as the final output?

The output of a GS, be it just normally or to SO is via a stream. You Append() a vertex to the stream and then RestartTriStrip() when you want a new set of (non triangle-strip) vertices. So yes, serially...

Demirug said:
JHoxley said:

D3D10 is, by default, thread aware/safe now. Now it has its layers mechanism I think you have to actually opt out of this feature.

Click to expand...

I know. For my managed layer I have touched nearly every single method, enumeration and function in the API. But this multi threading message was from the Refrast and not from the runtime.

I know you know - I've seen your work

I was more posting it as an aside for the general conversation in this thread than something you particularly may (or may not) need to know...

Demirug said:
Even in this case 50% are better than nothing. With DX9 you have to different caps and different performances. D3D10 eliminates at least the caps problem.

Yeah, it's definitely a good thing - I dont think anyone will argue with that. I think its just a case that some people are running off with the idea that it automagically solves all compatability problems.

Demirug said:
Maybe WinSAT will help with this a little bit.

I was looking into that recently, the docs dont seem to have much of a high granularity though. I forget the GPU-specific measure, but for the system to have an overall "5" rating is must have D3D10 hardware... currently "5" seems to be the highest, which doesn't give much by way of useful info.

Cheers,
Jack

Demirug · May 8, 2006

JHoxley said:
Yeah, it's definitely a good thing - I dont think anyone will argue with that. I think its just a case that some people are running off with the idea that it automagically solves all compatability problems.

Beside from the typical driver errors the compatibility problems should go away as any D3D10 hardware have to execute every D3D10 application. OK your customer will still be very unhappy if it runs slow like hell but at least it will run. How many times we have see that current games needs a patch first to run with a new GPU? I am still believes that D3D10 will solve this problem.

The problem of different performance will never be solved on the PC Platform.

But I think we both mean the same.

JHoxley said:
I was looking into that recently, the docs dont seem to have much of a high granularity though. I forget the GPU-specific measure, but for the system to have an overall "5" rating is must have D3D10 hardware... currently "5" seems to be the highest, which doesn't give much by way of useful info.

Cheers,
Jack

IIRC I read that the WinSAT API will give you much finer results than the numbers you can see in the vista performances dialog.

But as always a general benchmark is not the best solution to find the right settings for a game.

Jawed · May 8, 2006

3dcgi said:
One vertex is emitted at a time so if you want to output 32 vertices it will take at least 32 clocks. Of course the hardware can be processing multiple GS prims in parallel. I'm talking about the output from a single GS primitive.

Thanks.

So, in a unified shader architecture, the post-GS-cache will be consumed (and the resulting threads load-balanced) as the primitives/vertices are generated.

Jawed

NocturnDragon · May 8, 2006

JHoxley said:
All shader authoring is now in HLSL, no more assembly shaders

How are now handled shaders by the driver? Are they still precompiled to some kind of assembly after authoring like in dx9, or are they compiled at runtime by the driver like GLSL is?

arjan de lumens · May 8, 2006

Jawed said:
Thanks.

So, in a unified shader architecture, the post-GS-cache will be consumed (and the resulting threads load-balanced) as the primitives/vertices are generated.

Jawed

Only if the GS is single-threaded. If it is multi-threaded (and/or there are multiple GS units) there will be one thread per GS primitive, and you need additional buffering/locking so that the first vertex from thread N+1 is not consumed from the post-GS-cache before the last vertex of thread N (even if it is generated a long time before).

arjan de lumens · May 8, 2006

NocturnDragon said:
How are now handled shaders by the driver? Are they still precompiled to some kind of assembly after authoring like in dx9, or are they compiled at runtime by the driver like GLSL is?

IIRC, the DX10 shaders are compiled from HLSL to some sort of assembly by the DX10 runtime, so the actual driver itself still sees assembly code - you just aren't permitted to supply assembly code from the application side. It is IIRC also possible to obtain disassembly of the compiler output as well.

Demirug · May 8, 2006

arjan de lumens said:
IIRC, the DX10 shaders are compiled from HLSL to some sort of assembly by the DX10 runtime, so the actual driver itself still sees assembly code - you just aren't permitted to supply assembly code from the application side. It is IIRC also possible to obtain disassembly of the compiler output as well.

The Runtime compiles a HLSL shader to a BLOB (Binary Large Object). The shader create methods take such a blob to build a real shader object. You can still compile the shader as part of the production pipeline and distribute the binary version. The difference to D3D9 is that there is no ahader assembler available. But there is an dissembler.

JHoxley · May 8, 2006

How are now handled shaders by the driver? Are they still precompiled to some kind of assembly after authoring like in dx9, or are they compiled at runtime by the driver like GLSL is?

Just to add to what's already been said.. The runtime still compiles the HLSL code (the HLSL compiler is built into the core runtime, whereas strictly speaking D3D9's is not part of the runtime) and sends ASM on to the driver. When we asked about this (vs the GLSL route) the logic was that it gives the drivers less room to be difficult and implement things differently. I definitely agree with this - it'd be a real pain if not only different IHV's but also different versions of drivers had different compilers

The advantage for disallowing developers to submit "raw" ASM and force compilation via HLSL is that the compiler can be guaranteed to spit out valid, verified and well-formed ASM. This reduces the amount of validation that the driver has to do as it knows that any incoming ASM is good quality.

fwiw, an example of a ps_4_0 shader in assembly:

Code:

ps_4_0
dcl_input_sgv  v0.x , primitive_id
dcl_output o0.xyzw
dcl_constantbuffer_dynamic  cb0[33]
dcl_temps 2
mov r0.xyzw, l(0, 0, 0, 0x3f800000)
mov r1.x, l(0)
loop 
  ilt r1.y, r1.x, cb0[16].x
  not r1.y, r1.y
  breakc_nz r1.y
  uge r1.y, v0.x, cb0[r1.x + 0].x
  uge r1.z, cb0[r1.x + 0].y, v0.x
  and r1.y, r1.y, r1.z
  movc r0.xyzw, r1.yyyy, cb0[r1.x + 17].xyzw, r0.xyzw
  iadd r1.x, r1.x, l(1)
endloop 
mov o0.xyzw, r0.xyzw

IIRC I read that the WinSAT API will give you much finer results than the numbers you can see in the vista performances dialog.

You'll be able to grab a WinSATInfoLevel1 structure that contains a 'D3DMetric' and 'GraphicsMetric', but quite what scale the values are isn't exactly clear:

Actual engineering metrics for the D3D sub-system. Display the value with a total of four significant digits. The value is 0.0 if the current assessment is unavailable or not valid

hth
Jack

Xmas · May 8, 2006

arjan de lumens said:
If you're talking about tile-based renderers, they aren't really that much better off for framebuffer-in-shader-reads, even though they reduce the size penalty of the extra overlap checking. If you have N pixel shader invocations on a given pixel (for N overlapping polygons), and each of them reads the framebuffer content that resulted from the previous invocation, you still end up forcing all the invocations on the pixel to execute serially. Given that polygons, when rendered, often exhibit multiple orders of magnitude less temporal separation in a tiler that what you'd see in an immediate-mode renderer, this kind of serial execution is much more likely to appear (and seriously harm performance) in a tiler than in a non-tiler.

Serial execution is very similar to having a long shader in which segments depend on the previous ones. Except that the compiler can only optimize inside a given shader and not across combinations of shaders.
That's not an uncommon case, so if that would seriously harm performance you'd likely have low performance in the first place.

Humus · May 8, 2006

Demirug said:
If I have understood the concept right the only Caps in the future will be the version number.

Well, the features are fixed, but there are actually optional texture formats and capabilities in D3D10 (even though most has to be supported), so I already see a slight crack in this otherwise strict model. That's why I'm thinking we'll head back toward a more liberal approach a few revisions into the future, probably by developer requests (ironically enough). I wouldn't be surprised if developers end up thinking that a common API with coarse caps is better than separate APIs for different generations but with strict functionality. Of course, I might be wrong, this is just my speculation.

DX10 Checklist: What made it into D3D10? What was cut?

Bob

psurge

JHoxley

Jawed

arjan de lumens

Demirug

3dcgi

psurge

Nom De Guerre

JHoxley

Demirug

Jawed

NocturnDragon

arjan de lumens

arjan de lumens

Demirug

JHoxley

Xmas

Porous

Humus

Crazy coder

Similar threads