Tech ARP: Direct3D Improvements in Windows 7

Jawed

Legend
http://www.techarp.com/showarticle.aspx?artno=637&pgno=0

Nice summary article which includes a few nuggets I wasn't aware of:

Direct3D 11 can perform multi-element stream output to multiple independent streams, including the rasterizer. This reduces the need for expensive CPU involvement in multipass GPU rendering and computation.
Can't say I understand that though.



Direct3D 11 accelerates techniques that often use depth buffers, which are common in the current generation of games. Direct3D 11 can :
  • Write depth from the shader conservatively, maintaining early-z acceleration.
  • Declare a read-only depth view that can be simultaneously bound as depth and texture, allowing z-comparison, z-rejection, and texture read without copying the depth buffer. Scenarios include soft particles and volumetric fog effects.
Jawed
 
Can't say I understand that though.

I think it means that you can now have multiple element output AND multiple buffers at the same time, when using Stream Out.
In DX10 you could either have one buffer with multiple elements, or multiple buffers, each receiving one element.
And it seems you can also direct your output to the rasterizer directly. Afaik this wasn't possible in DX10... You either captured the output in a buffer, or you sent it to the rasterizer, but you couldn't output some elements to a buffer, and some to the rasterizer directly.
Sounds like you can do that now.
 
Last edited by a moderator:
I think it means that you can now have multiple element output AND multiple buffers at the same time, when using Stream Out.
In DX10 you could either have one buffer with multiple elements, or multiple buffers, each receiving one element.
Blimey, didn't realise that, that's pretty restrictive, if you're doing amplification with D3D10 GS it sounds like you can only do so with a single stream.

http://download.microsoft.com/download/f/2/d/f2d5ee2c-b7ba-4cd0-9686-b6508b5479a1/Direct3D10_web.pdf

That confirms it.

Though I suppose it's fiddlesome territory if you're using GS to provide two separate degrees of amplification ("intermediate" and "high", say). And then there's the whole, "don't amplify with GS" mantra...

And it seems you can also direct your output to the rasterizer directly. Afaik this wasn't possible in DX10... You either captured the output in a buffer, or you sent it to the rasterizer, but you couldn't output some elements to a buffer, and some to the rasterizer directly.
Sounds like you can do that now.
Blythe seems to say that SO accesses a subset of data from GS that can be written to streams in parallel with the GS output that goes to RS. "Stream Output (SO) copies a subset of the vertex information output by the GS to up to 4 1D output buffers in sequential order."

So maybe it's just that "branching" to RS and SO is not concurrently possible in D3D10 when amplifying, even if the SO is a single stream.

Jawed
 
Blimey, didn't realise that, that's pretty restrictive, if you're doing amplification with D3D10 GS it sounds like you can only do so with a single stream.

http://download.microsoft.com/download/f/2/d/f2d5ee2c-b7ba-4cd0-9686-b6508b5479a1/Direct3D10_web.pdf

That confirms it.

Though I suppose it's fiddlesome territory if you're using GS to provide two separate degrees of amplification ("intermediate" and "high", say). And then there's the whole, "don't amplify with GS" mantra...


Blythe seems to say that SO accesses a subset of data from GS that can be written to streams in parallel with the GS output that goes to RS. "Stream Output (SO) copies a subset of the vertex information output by the GS to up to 4 1D output buffers in sequential order."

So maybe it's just that "branching" to RS and SO is not concurrently possible in D3D10 when amplifying, even if the SO is a single stream.

Jawed

Well I think this extension is mostly for automating some multipass things.
As Blythe says: "A GS program can also simply affix additional attributes to a primitive without generating additional geometry, for example, computing additional uniform-valued attributes for each primitive."
It seems that if you can render multi-element data to multiple streams at a time, you can prepare data for multiple rasterizing passes with just a single VS->GS->SO pass.
Not that I've really worked out how you'd use this functionality in a practical sitation at this point, but it sounds like it could be useful.
 
so the only thing that would make an immediate difference in current games is the driver level multithreading?

if so what kind of improvement are we looking at?

also what are the numbers regarding stream computing, in terms of market size in the last couple years, it seems that every new "graphics" technology actually has little to offer for actual graphics.
 
I believe the Compute Shader is supposed to reduce the cost of some post-process effects which take a large number of samples from render textures.
 
so the only thing that would make an immediate difference in current games is the driver level multithreading?

Current games don't use the DX11 API, so they can't make use of the DX11 multithreading features either.

But even with the DX11 multithreading, I don't think gains will be spectacular, to be honest. The DX11 multithreading example in the DXSDK isn't very impressive in terms of performance gains.
 
Richard Huddy's suggestions

http://www.xbitlabs.com/news/video/...osting_Features_of_DirectX_11_First__ATI.html

For multi-threading, "Typical benefit is going to be around 20%, but I expect a large variation of that; the variation it can be north of 50%" and post-processing "Typically we see that post-processing costs from 10% to 25% of frame time (depends on post-processing". And then there's HDR compression "If I had a spare day and HDR, then I would conduct HDR effects into Microsoft’s new format because that be twice compact that way".

Jawed
 
I wonder where Huddy gets his figures from.

Display lists seem mostly interesting for the creation and loading of resources and such. Things you don't generally do every frame.
Using display lists for regular frame rendering isn't going to save you THAT much imho, because there shouldn't be too much overhead anyway if you optimize and batch your state-changes and drawing calls nicely. I don't see you gaining 50% from that, unless your code was really poor to begin with.

And +25% performance from post-processing? He contradicts himself there directly.
He says: "Typically we see that post-processing costs from 10% to 25% of frame time (depends on post-processing)".
Okay, so how can you get 25% extra performance from something that takes 10-25% of frame time? Compute Shaders may make post-processing a bit more efficient in some cases, but they aren't going to make post-processing free.
 
I get a "mistranslated English" feeling from that article.

e.g. the simple gaussian blur example in the GDC09 slides cuts bandwidth consumption by 75% (down to 16MB per frame) but that might only amount to a 1 or 2% performance increment I imagine.

Shader Model 5.0 and Compute Shader

But performance is affected jointly by bandwidth and fetch count. Fetch count (which is where gather (fetch4) originally helped in reducing shadow filtering cost) additionally has a latency component. Fetches from memory obviously have the worst latency but fetches from thread local storage also have latency (and sub-register bandwidth).

But I guess we'll just have to wait to measure these things in detail...

Jawed
 
I get a "mistranslated English" feeling from that article.

e.g. the simple gaussian blur example in the GDC09 slides cuts bandwidth consumption by 75% (down to 16MB per frame) but that might only amount to a 1 or 2% performance increment I imagine.

Yea... however you try to interpret it, it doesn't add up.
I mean, if you take the performance percentages as absolutes, they contradict directly, as I said in my previous post.
But if you take them as relative (eg "Compute Shaders can make the post-processing stage about 25% faster"), then it contradicts the enthusiasm of Huddy in general.
I mean, if post-processing takes 10-25% of frame-time, and you save 25% on that, you only save 2.5% to 6.25% of total frame-time. Why would Huddy rave about the DX11 gains, when they are so marginal? A slight bump in clockspeed would give you more gains than that.
 
I mean, if post-processing takes 10-25% of frame-time, and you save 25% on that, you only save 2.5% to 6.25% of total frame-time. Why would Huddy rave about the DX11 gains, when they are so marginal? A slight bump in clockspeed would give you more gains than that.
I get the impression that game developers will chase down the last ~5% of performance gains in their engine, so it doesn't seem extraordinary to me.

Jawed
 
I get the impression that game developers will chase down the last ~5% of performance gains in their engine, so it doesn't seem extraordinary to me.

Maybe, but I get the impression that Huddy's talk is aimed more at consumers than at developers. Developers will obviously take what he says with a grain of salt. Aside from that, developers don't need Richard Huddy to explain to them why DX11 is so great, because they've had the DX11 tech preview since November 2008, and already know about all the features that Huddy is talking about. Besides, nVidia, AMD and Microsoft have had various talks about DX11 on developer conferences in the past year.

I think Huddy is just out to create a buzz around DX11 among consumers. Heck, in another article from Huddy regarding DX11, he was even pimping the XBox360's tesselator, and DX10.1... ("DX10.1 is the closest thing to DX11 right now...", yea whatever).
 
ya i just read the xbit article. yawwn.

ahh well, i was hoping for some free fps.

what ms should do is support native stereoscopic 3d, force compliance to a standard

its the next big thing, it has been for the last 15 years lol

anyhow that would be an exciting
 
Current games don't use the DX11 API, so they can't make use of the DX11 multithreading features either.

But if you already do DX10 the port to DX11 is pretty straightforward. There are no particular drawbacks of the transition either so I would imagine that most developers currently on DX10 will go DX11 fairly soon.

But even with the DX11 multithreading, I don't think gains will be spectacular, to be honest. The DX11 multithreading example in the DXSDK isn't very impressive in terms of performance gains.

I'm observing a 40% gain, which I think is a pretty respectable gain.

Display lists seem mostly interesting for the creation and loading of resources and such. Things you don't generally do every frame.

You can't even use display lists for that. All resource creation is done on the main device. The deferred contexts can only do rendering commands and are most certainly intended for things you do every frame. The typical usage case is to split the frame's different rendering passes to different threads.

While overall performance should see a nice increase, I don't think benchmark figures is where you'll see the benefit the most. The biggest benefit the way I see it is that you should get a more stable framerate. When you enter that part of the world where the number of draw calls happen to shoot up to 5000, the framerate wouldn't drop to mid-20s anymore but could stay in the 50-60 range.

And +25% performance from post-processing? He contradicts himself there directly.

The headline was most likely not chosen by Huddy though. The actual quotes seem fine to me.

Okay, so how can you get 25% extra performance from something that takes 10-25% of frame time? Compute Shaders may make post-processing a bit more efficient in some cases, but they aren't going to make post-processing free.

Well, it's certainly possible to optimize 25% of the frame time and get a 25% performance boost. If you eliminate 20% (leaving 5%) the performance gain is 1/(1-0.2)=1.25. If post effects were made completely free you'd see a 33% gain.

Of course, I don't expect that you'll be able to cut 80% out of the posteffects in the average case. Although I agree with Richard that "significant performance win" certainly is possible.

e.g. the simple gaussian blur example in the GDC09 slides cuts bandwidth consumption by 75% (down to 16MB per frame) but that might only amount to a 1 or 2% performance increment I imagine.

Are you talking about the total frame time? Out of the blur effect's time I would expect such a reduction to cut it in half or so. The biggest gain isn't going to come from bandwidth reduction but from reducing the texture fetch count and turning posteffects from mostly fetch bound to ALU bound.
 
But if you already do DX10 the port to DX11 is pretty straightforward. There are no particular drawbacks of the transition either so I would imagine that most developers currently on DX10 will go DX11 fairly soon.

Certainly, I have said the same thing myself, we'll probably see DX11 games being received soon.
It's just that I don't expect developers to actually release DX11-patches for currently released games.

I'm observing a 40% gain, which I think is a pretty respectable gain.

What with?
In my case it was more in the range of 5-10% at best, and actually a performance DECREASE in some cases.

You can't even use display lists for that. All resource creation is done on the main device. The deferred contexts can only do rendering commands and are most certainly intended for things you do every frame. The typical usage case is to split the frame's different rendering passes to different threads.

I think we're arguing over semantics here.
I'll admit that I may have gotten the names wrong, but I think we will both agree that DX11 allows you to do two things:
1) Create resources on other threads (loading textures, compiling shaders etc).
2) Prepare a list of rendering calls on other threads, which can later be executed on the main thread.

I was just saying that 1) may see a large increase in performance, but it is generally not during the actual gameplay itself, but mostly during load time.
I was also saying that in my experience 2) doesn't really give much of a boost at present. As I said, 5-10%, or even a performance decrease in some cases.

Well, it's certainly possible to optimize 25% of the frame time and get a 25% performance boost. If you eliminate 20% (leaving 5%) the performance gain is 1/(1-0.2)=1.25. If post effects were made completely free you'd see a 33% gain.

Yea I took that into account. As you say, assuming such a big boost from compute shaders is highly unrealistic.
Besides, that's the best-case assumption of 25% frametime. Assuming the frame time is 20% or less, you wouldn't be able to get a 25% boost even if post-processing were free.

So I guess we differ of opinion regarding Huddy's statements. I don't think what he proposes is realistic, nor that you can talk about 'significant performance gains'.
Even if I were to paint a very positive scenario... say that post-processing does take 25% frame time... and that you can eliminate 50%...
That gives you 1/(1-0.125) = 14% gain.
It's just not such a big deal imho. It would get a game from 30 to 34 fps... And that is a case that is more positive than what I believe in.
I expect the differences to be more from 30 to 32 fps or so, on average. Which is marginal rather than significant.
 
While overall performance should see a nice increase, I don't think benchmark figures is where you'll see the benefit the most. The biggest benefit the way I see it is that you should get a more stable framerate. When you enter that part of the world where the number of draw calls happen to shoot up to 5000, the framerate wouldn't drop to mid-20s anymore but could stay in the 50-60 range.
Sounds tasty.

Are you talking about the total frame time? Out of the blur effect's time I would expect such a reduction to cut it in half or so. The biggest gain isn't going to come from bandwidth reduction but from reducing the texture fetch count and turning posteffects from mostly fetch bound to ALU bound.
I was thinking purely in terms of being bandwidth bound.

But, separately, the problem with the CS implementation is that it trades TEX fetches (which on ATI should be significantly cached in L1) for LDS fetches. Since, per "pixel", it only uses each sample once it pays the full bandwidth/latency penalty for LDS fetch, which needs to be preceded by thread group fetches that populate LDS plus a thread group "syncthreads".

It seems to me this is really about the percentage of L1 misses. If L1 hits 100%, then PS is faster than CS because there's no syncthreads and no L1-fetch-into-LDS. Obviously L1 won't hit 100%, so now it's a question of the latency margin caused by those misses...

Against this, the nature of a filtering kernel tends to fight rasterisation order (Z, say) since they're linear space not rasterisation space, which theoretically causes a substantial L1 miss rate. One of those things that doesn't seem to have been benchmarked very effectively as far as I can tell.

It seems to me there's a real risk that the CS implementation won't be significantly faster simply because of the low arithmetic intensity. But a much larger kernel size obviously changes things, both by exacerbating L1's problems with linear space fetches and the thrashing caused by competing pixels. Though it's interesting to note that D3D11's thread local storage is only 32KB, which isn't vastly larger than the cluster's L1 size (8KB seems prolly too small, but I don't remember if L1 size is ever stated anywhere for current hardware).

(CS actually has a back-door optimising effect on ATI: if a shader is not too heavy on register allocation, then the PS version will result in more concurrent pixels fighting over L1. Whereas in the CS version it's not possible to have more than 1024 threads sharing data amongst themselves (16 wavefronts). This optimisation doesn't really apply on NVidia (at least not currently) because NVidia's register files are too small to have such a large number of competing pixels in the PS version, i.e. there's a hard limit of 1024 pixels on current hardware, anyway.)

The main effect of CS could simply be that the original fetch (each thread fetches its corresponding texel and posts it in thread local storage) is highly cache coherent, with the best possible L1 hit rate and thereafter everything is like a guaranteed 100% hit rate (but from thread local storage).

Another paramter here is the size of the thread group. It's no good making a single thread group 1024 in size, because this creates dead time at the start and end of the kernel when LDS is being populated by the new thread group (i.e. until syncthreads passes), or some of the threads in the workgroup (16 wavefronts' worth) have completed. Thread local storage cannot be freed for the next thread group until all threads in the thread group have completed.

So the programmer needs to size the thread group minimally, based on the kernel size. But rasterisation order then rears its ugly head as too-small thread-groups increase the total number of repeated fetches and incoherence (i.e. linear space not rasterisation space fetches). And overlaps are required (apron is a nice term) at the edges of the region being filtered.

It seems the example code uses a thread group size of 1024, processing an entire row (or column) of the source 1024x1024 texture at a time. So no apron and I presume a fair amount of wasted ALU cycles due to thread group start-up and shut-down intervals.

I don't know where the break-even point is in a comparison of PS and CS - maybe the 7-wide kernel is just beyond the break-even point?

Jawed
 
What with?
In my case it was more in the range of 5-10% at best, and actually a performance DECREASE in some cases.

With a HD 4890. I don't know if it matter that I'm running on a beta driver right now. The full gain will of course be realized only with a proper DX11 driver. I don't know if there's any particular DX11 support in official releases, at least I wouldn't expect any.

I think we're arguing over semantics here.
I'll admit that I may have gotten the names wrong, but I think we will both agree that DX11 allows you to do two things:
1) Create resources on other threads (loading textures, compiling shaders etc).
2) Prepare a list of rendering calls on other threads, which can later be executed on the main thread.

I was just saying that 1) may see a large increase in performance, but it is generally not during the actual gameplay itself, but mostly during load time.
I was also saying that in my experience 2) doesn't really give much of a boost at present. As I said, 5-10%, or even a performance decrease in some cases.

Well, 1) is something you could do in DX9 already, and I agree that doesn't boost performance much, although it can eliminate hitches. Actually, loading resources during gameplay isn't all that uncommon. Many games stream in their data continuously.
As for 2), given that there are noticable gains on consoles when doing similar things I think the gain in DX11 should be in the same ballpark. The DX SDK sample performance seems to support that, at least on my machine.

So I guess we differ of opinion regarding Huddy's statements. I don't think what he proposes is realistic, nor that you can talk about 'significant performance gains'.

All I'm saying is that you shouldn't blame Huddy for the stupid headline. Huddy isn't claiming a 25% total gains can be had. He claims a "significant win" in the area of posteffects, which are usually "10%-25%" of the frame time. I agree with all that.

Even if I were to paint a very positive scenario... say that post-processing does take 25% frame time... and that you can eliminate 50%...
That gives you 1/(1-0.125) = 14% gain.
It's just not such a big deal imho. It would get a game from 30 to 34 fps... And that is a case that is more positive than what I believe in.
I expect the differences to be more from 30 to 32 fps or so, on average. Which is marginal rather than significant.

As a game developer I have a somewhat different view of what is "significant". 14% is a jumping up and down in joy and happiness kind of gain. :)
Although usually you don't measure the gain in percentage since it can be quite misleading, but instead you measure the number of milliseconds saved from the frame time. Otherwise how heavy the rest of the scene is affects the judgement of how good an optimization was or how expensive a new feature is. So if you have 3ms of posteffects and you save say 1.5ms, that's a pretty damn good save.

I was thinking purely in terms of being bandwidth bound.

Well, that's a rare case for post-effects. They are usually texture fetch bound.
 
Back
Top