Your Wish List : Things that didn't make D3D10

Yes there is some support for tessellation in DX9 but it doesn’t go very far. D3D10 doesn’t support it anymore. The R600 tessellator should be able to support the DX9 style tessellation but as this thread is about D3D10 I haven’t include this in my first answer.
 
why would they remove it it seems like a good feature to have or can you do the same stuff with geometry shaders or am i missunderstanding what they are for

another Q in the r600 refresh do you think they will remove that tessalation unit and maybe replace it with something else, render back ends maybe ?

ps: found another quote
"Microsoft is pushing hard to make tessellation a requirement of the next DirectX (DirectX 10.1 or DirectX 11 or whatever they end up calling it), so ATI may be a little ahead of the curve here."
 
Last edited by a moderator:
The DX9 tessellation has a fixed function style and was never really supported. Therefore it was logical that Microsoft had removed it. You can use the geometry shader to do tessellation and I don’t know what the R600 can do better with its dedicated unit.

As I am not sure how big this unit it it may or may not be interesting to remove it.
 
For DX10.1 I hope:
- 64bits FP precision support ( that includes ZBuffer and stencil )
- Full (32/64) floating point texture filtering.
- Implement cubemap arrays
- Min AA caps required
- Improved SLI/MultiGPU syncronization-coordination routines.
- "Jumbo" 32/64bits floating point textures support for GPGPU computing with the corresponding sampler. Basically what I want is the possibility to allocate 768MB in a 1D texture like CUDA. I'm not sure but is not the maximum DX10 texture 4096 and 8192 for DX10.1?

For DX11 I could use:

- A full-programable per-pixel Blend Shader with full R/W support.

- Full customizable and programable AA shader.

- Second-depth Z-buffer support ( for SSS, shadow bias, simple transparency sorting, etc )

- Multiple texture fetching in one call. Something like the ATI fetch4. For example, imagine I wanna get the 17x17 neighbor pixels in a cubemap point... In code:

Code:
void myPS ()
{
    float multifetchedValues[17][17] = texCUBEMultiFetched(cubeSampler,myVec3,17,17);
}
That "texCUBEMultiFetched" will perform a simple cubemap texel fetch. Then it gets the 17x17 surrounding samples in the fetched cube face. A special case must be implemented in case the neighbors uses a different cube face of course. A tex1D/2D version could be useful too.

This could be used for penumbra shadows, PCF, etc...

- Much more advanced texture compression based on wavelets ( like a JPG2000/WMP, but with block decoding ). This can be questionable, specially with GDDR prices coming down but.... well, some graphics cards can decode MPEG in HW so this could be possible.

- A good solution for alpha-blended transparency ( like an A-buffer, blah blah ) because all the current methods lack or are too slow like the depth-peeling.

- A simple raycast HLSL instruction will help too and could be the start of raytracing. I heard NVIDIA is working on a demo showing this. Something like creating an acceleration structure when you load a mesh (VB+IB) into VRAM, then do optimized ray-triangle test in local space with a HLSL called "raycast" inside the vertex/geometry/pixel shader.

Just my 2 cents.
 
Last edited by a moderator:
Full support stereo rendering, not just stereo backbuffer.

You can easily do single pass stereo rendering in DX10 with the GS.

- 64bits FP precision support ( that includes ZBuffer and stencil )

Why would you want 64bit Zbuffers? Well, DX10 has 64bit depth-stencil surfaces, it's just that 24bits are not used. Are you hoping for like 56bits depth and 8bit stencil?
 
Why would you want 64bit Zbuffers? Well, DX10 has 64bit depth-stencil surfaces, it's just that 24bits are not used. Are you hoping for like 56bits depth and 8bit stencil?
Nope really! I want more than 8bits for stencil... lets say 32Z + 32 stencil. I think basically the stencil should be able to work with 32bits objectIDs... 256 IDs are definitely not much hehe.
Also I could use a 64bits Zbuffer ( double precision ) with no stencil :D ( for example for large camera frustrums or more accurate shadowbuffers ).
And now that we talk about the ZBuffer I could use that 2nd-depth z-buffer for SSS and to mitigate shadow biasing problems too!

I think the double precision is coming in the DX10.1 though.
 
Last edited by a moderator:
It is true that as you begin using Stencil for things that it wasn't really made for, D32_S32 or even D32_S8S8S8S8 with a multiplexer (or, fully decoupling depth and stencil) might have some uses. I'm not convinced those are very important, but I'm sure if you really wanted to, you could think of some cool stuff there.

In the end, none of this makes sense today. However, within the next 5 years, they suddenly making a lot of sense when the programmable shader core replaces the ROPs completely... (No, AMD, you can't claim you were forward-looking by up to 5 years, sorry! ;))
 
I think basically the stencil should be able to work with 32bits objectIDs... 256 IDs are definitely not much hehe.
Unless you also get the ability to output stencil from the pixel shader (which is likely to come with a performance impact) then supporting this number of stencil bits for object IDs would imply you're rendering each of those object in their own separate call which isn't good for batch performance.
Btw there is a D32_S8 format in D3D10 (32 bit depth, 8 bit stencil).

Also I could use a 64bits Zbuffer ( double precision ) with no stencil
Why not, however currently the lack of depth precision often comes from poor utilization of projection matrices more than the "limited" bit precision in depth buffers. I suppose a space rendering engine (with planets and spaceships etc.) might benefit from a 64-bits depth without the hassle of having to partition your depth range.
 
Last edited by a moderator:
I want more than 8bits for stencil... lets say 32Z + 32 stencil. I think basically the stencil should be able to work with 32bits objectIDs... 256 IDs are definitely not much hehe. DX10 supports it but I don't think the current HW can use it actually.
Just use an int32 texture and dynamic branching for "early out". I seriously doubt a hardware-implemented stencil buffer would be any faster than that on modern hardware particularly if you're outputting stencil values from the shader. Hell I can't even get early-stencil to work properly in many *normal* cases!

Stencil buffers are useful for a few "tricks" (like stencil routing - particularly when combined with MSAA!), but honestly many of the things that people use them for can be performed just as efficiently with a normal texture nowadays. The same can't quite be said for blending operations (which you can do similar read-modify-write cycles using stencil if you want to) due to double-buffering issues, but I suspect that will eventually be the case as well.

I'd actually rather have *less* fixed hardware like depth and stencil and more programmable stuff :) Depth still makes sense IMHO due to the commonality of its use and the semi-complex data structure that it implements, but stencil is already becoming questionable.
 
Unless you also get the ability to output stencil from the pixel shader (which is likely to come with a performance impact)
Just use an int32 texture and dynamic branching for "early out". I seriously doubt a hardware-implemented stencil buffer would be any faster than that on modern hardware particularly if you're outputting stencil values from the shader
Ok, what about something like a "blend shader" stage with the ability to read and write? That could be nice!

Basically what it is is a fourth shader stage that could be after the fragment shader.
It takes the various bits of data generated by the fragment shader + whatever is in the current fragment of the render target and then output it to the specific fragment. It could be a simple pass trough shader or perhaps more complex...

There were some thoughts in the OpenGL forums but is going to be hard to implement, specialy due to speed problems reading values already "in use".

ps: Edited my prev post to add more crazy ideas :p
 
Last edited by a moderator:
Can you give a little bit more info on that? I'm very curious on what doesn't work there, and if you have any idea why! :)
I was using stencil for a while with deferred shading to stencil out light volumes (works nicely with z-buffering). However after some benchmarking I realized that while the stencil was *working*, it wasn't actually making it any faster than just shading the whole screen. I spoke to NVIDIA about it and they jokingly suggested that I rename my app to "Doom3.exe" ;) Basically early-stencil seems to work for exactly the case that Doom3's rendering path uses and pretty much nothing else, even in cases where it is theoretically possible as there are no data dependencies.

With respect to a "blend" stage, it would certainly be useful, but indeed read-modify-write cycles are difficult and somewhat expensive to implement in a programmable manner. The hardware people can probably explain more...

Many things can certainly be done with the current blend modes though, and more if we had bit-wise blending operations for integer textures.
 
I was using stencil for a while with deferred shading to stencil out light volumes (works nicely with z-buffering). However after some benchmarking I realized that while the stencil was *working*, it wasn't actually making it any faster than just shading the whole screen.
I believe what you're telling us, but that makes no sense at all! If you have a lot of volume lights to apply then the cost of shading your scene should be much higher with a fullscreen pass per light compared to marking the volume areas with stencil and only shading those for each light. Of course this depends on your shader complexity but overall this should be true (even if you use dynamic branching to reject out-of-range pixels during the shading passes). You're not using insanely-tesselated volumes (spheres?) for the volume lights are you? (on a unified architecture this may take some of the power you wanted for pixel shading).

With regard to your comment that textures "can" do the same thing as stencil, yes, they probably can (especially on D3D10) but you cannot expect the same level of performance as stencil buffering. The stencil buffer test is part of the pipeline and has dedicated hardware optimizations (like early stencil testing as you mentioned - it's supposed to work :)) whereas an int texture will need to be written to and fetched like any other textures (both phases require dedicated shader instructions and consume precious color bandwidth).
 
I believe what you're telling us, but that makes no sense at all! If you have a lot of volume lights to apply then the cost of shading your scene should be much higher with a fullscreen pass per light compared to marking the volume areas with stencil and only shading those for each light.
Oh definitely it should have been faster - that's why I was using it ;) I ended up just projecting the light volume BB's and using a scissor test on the GPU in that implementation which was fast enough. Note that stencil was *entirely* broken on ATI/OpenGL at the time and I still don't think early stencil works properly in that demo.

As I mentioned when I spoke to NVIDIA their response was that making early stencil work is touch and go. Honestly I don't think it tends to work in many cases than the exact path of shadow volumes.

You're not using insanely-tesselated volumes (spheres?) for the volume lights are you? (on a unified architecture this may take some of the power you wanted for pixel shading).
Oh of course not - seriously, early stencil reject was just not working... it was happening *after* the shader.

With regard to your comment that textures "can" do the same thing as stencil, yes, they probably can (especially on D3D10) but you cannot expect the same level of performance as stencil buffering.
I dunno, it seems to me that hardware is getting pretty general and most of the specific API functionality is implemented in a general way in the driver anyways. This is particularly true when you look at the design and flexibility that you get with something like CTM (particularly) or even CUDA. Maybe not this generation, but I don't see a need for a fixed-function stencil buffer in the long run.
 
I was using stencil for a while with deferred shading to stencil out light volumes (works nicely with z-buffering). However after some benchmarking I realized that while the stencil was *working*, it wasn't actually making it any faster than just shading the whole screen. I spoke to NVIDIA about it and they jokingly suggested that I rename my app to "Doom3.exe" ;) Basically early-stencil seems to work for exactly the case that Doom3's rendering path uses and pretty much nothing else, even in cases where it is theoretically possible as there are no data dependencies.

Nvidia hardware seems to be a lot more sensitive with stencil. On ATI hardware you should not have any trouble with early-out stencil. In fact, you'll probably see better performance that way in many cases than using dynamic branching. On R600 it should be even better as it has Hierarchical-stencil as well, unlike previous generations that could only reject on the EarlyZ stage. I haven't revisited this topic with R600, but my gut feeling is that early-out with stencil should be better than ever.
 
Nvidia hardware seems to be a lot more sensitive with stencil. On ATI hardware you should not have any trouble with early-out stencil. In fact, you'll probably see better performance that way in many cases than using dynamic branching.
Yeah but that's all entirely besides the point since ATI has quite possibly the most terrible MRT implementation in OpenGL (which this app was) that I've ever worked with :( Because of that simple fact, no ATI hardware could even touch NVIDIA 6's and 7's, let alone 8's.

On R600 it should be even better as it has Hierarchical-stencil as well, unlike previous generations that could only reject on the EarlyZ stage. I haven't revisited this topic with R600, but my gut feeling is that early-out with stencil should be better than ever.
Cool, although like I said I don't care that much about stencil. It can be useful for a few algorithms but IMHO it's a bit of a hold-over from fixed-function days that's only still there in hardware because of shadow volumes, which I also don't care for ;)
 
I spoke to NVIDIA about it and they jokingly suggested that I rename my app to "Doom3.exe" ;) Basically early-stencil seems to work for exactly the case that Doom3's rendering path uses and pretty much nothing else, even in cases where it is theoretically possible as there are no data dependencies.

In DX you need to set the stencil state to D3DSTENCILOP_KEEP to allow early stencil. See .http://forum.beyond3d.com/showthread.php?p=286194#post286194

Could be the same in OGL ?
 
Yes it is the same in OGL, but I did that and every other thing they asked, and still no early stencil :(
It's important to clear the stencil buffer every frame, not just once, even if you completely fill it again and again.
 
Back
Top