DirectX Developer Day 2020

MS is a bad guy for not supporting a feature that at this very moment only ONE architecture on ONE IHV supports? Really?
No. That's not what i mean. My rant is not directed specifically to MS, NV, AMD, Khronos, or game industry... probably i expressed myself a bit unfortunately.
I want both NV and AMD to offer extensions to lift restrictions on compute. I'm fine if compatible hardware is rare because it will take ages until my work finds its way into product.
I would be even fine with vendor APIs, replacing the need for Khronos and MS to find compromising solutions after lots of time.
And i see myself as an outsider but also feel left behind, because my compute work is different from what we see in current games.

the barrier API in mantle was fundamentally incompatible with nvidia's hardware
I assumed so. VK has added conditional draw. It would be little work for AMD to extend this to support barriers.
The same is it with NVs device generated command buffers extension for VK. I talked with it's developer here on the form, and he considered to add this. I have not checked if they already did an update, but being heard is good and all i want.
I know it will take time - lots of it, so the earlier we request things the better.

I find it extremely unlikely you'll come up with a compelling use case that's practical/feasible for the general market to use.
You may be right, which is why i mentioned the brute force mentality i see in games.
Personally i could put those things to good use (probably). And i look a bit envy on offerings e.g. at OpenCL 2.0 (which also works on NV), while i assume this solution is not hardware friendly enough for games.
(In my case the workloads in question are not massive - i would accept some performance costs for the gained flexibility which might result in a speed up at the end.)
There certainly is some risk in assuming 'games don't need this'. Games might find good use for new functionality after it becomes available.

Ok. That said, let me appologize for my offended reaction above. It's another case of heat for no reason i guess. ;)
 
But ask yourself why that is? Remember, 1.0's purpose was to make it easier for developers to hit acceptable levels of performance without needing to know how the underlining hardware worked.

It stopped serving that purpose when it wasn't able to hit "acceptable levels of performance" on other vendors. "Performance portability claims" need to be evaluated on a constant basis and sometimes they turn out to be false in the end.

For instance, there is nothing in DXR that mandates the underlining acceleration structure be BVH. It could technically be anything. It's still too early for everyone (IHVs and ISVs) to agree on a generic acceleration structure for RT. In the early stages you need the black box.

I am well aware of this just as much you are and as ultimately with any "black box" the drivers on PC APIs can 'pretend' to give the results the user wants underneath it all.

I love you "black box truthers" btw. Yes MS is artificially limiting your flexibility in compute shaders because [some reason I never understood]. I got bad news for you, the underlining hardware just isn't that good (scheduling) or doesn't match up well enough with other architectures (barriers) to be more generic. You think that's real callable shaders you're getting in DXR? hahahahahahahahaha...surprise it's just a really big uber shader created for you in the background. You can do this already yourself. You don't though because usually you care about maximizing your performance. Which of course is the crux of the matter, you want callable shaders and more flexibility in CS? Sure, how much performance you willing to give up for them? MS doesn't include callable shaders in CS because people like you would actually use them instead of VERY sparingly using them in isolated use cases on an isolated group of GPUs (ray tracing). Then you'd wonder why everything is so slow. Must be the "PC APIs"...

I realize that a function call can get 'inlined' by the driver's shader compiler so that there's no cost of paying for the overhead of a function call but from what I recall this isn't true for Nvidia hardware on Volta or after based off what I heard from one of their engineers. It sounds like they actually use independent thread scheduling to keep track of the different function calls with their program counter per-lane.

Also why depend on the compiler to get it right when I can be 'explicit' about such with the newer API ? If I can do it by myself instead of having the driver do it automatically for me then why not take advantage of this fact ? It's just more points in favour of DXR Tier 1.1 if this optimization can be performed and applied more often in the manual case ...

I just want to note that it's natural for black boxes to have progressions and evolutions. When we evolve and change an api it doesn't mean the previous versions were bad!!! DXR 1.0 was a good abstraction for the time

We must have very different experiences then because OpenGL (even modern versions) in it's entirety is an objectively bad API abstraction on most modern hardware today aside from Nvidia hardware which has been traditionally modeled as a global state machine for a very long time so they implement Vulkan on top of OpenGL in their drivers but it would be the other way around on most vendors.

Often times if a black box 'outlives' it's necessity, it rightfully gets 'discarded' ...
 
Hey I want callable shaders just as much as you ;), but remember graphics hardware is like everything else in life, a zero sum game. To increase flexibility in one area we must lose some efficiency [relative performance] in another. IHVs are very sensitive to this (remember: bar graphs of today's games sell GPUs, not the promises of tomorrow). I don't doubt callable shaders are the future, but I also think you might be underestimating the "cost" to make them fully generic (and performant).

Before you look down on the uber shader strategy, think about the practical benefits of focusing on making them faster today in hardware. What does that gain us? Well for starters it let's us "emulate" callable shaders for the first time in controlled environments (ray tracing). But just as importantly, it can also bring performance benefits to current games that leverage large uber shaders. Something that helps us step forward, but also increase (some) bar graphs. Was this exactly what you wanted? Clearly it was not. Can we agree though DXR was a huge step forward? These "bridge solutions"/black boxes/evolutions should not be treated with disdain, but celebrated! You'll get your callable shaders one day, but it may take a few evolutions to get there. :p Slowly the restrictions will melt away...

To tie this back to ray tracing, all I'm saying is be careful for what you wish for. I don't know about you guys, but I'm not exactly blown away by ray tracing performance at the moment. How about before we really expand the flexibility of DXR, we make what we currently have (which I think can look great!) faster. :mrgreen:
 
It stopped serving that purpose when it wasn't able to hit "acceptable levels of performance" on other vendors. "Performance portability claims" need to be evaluated on a constant basis and sometimes they turn out to be false in the end.

I mean...it's hard to argue against something that you pulled out of thin air. Can you cite specific examples of the DXR 1.0 API you feel are not portable performance wise? I don't find that to be DXR 1.0's limitation, but rather its awkwardness with integrating with existing engines.

It sounds like they actually use independent thread scheduling to keep track of the different function calls with their program counter per-lane.

This is why information can be dangerous. Yes Nvidia now includes an extra register per vector lane to keep track of program counters. No this doesn't make things magically fast. It makes things less slow (you still loose all efficiency during divergence, it just makes scheduling significantly easier during divergence within a warp...diverging is still SLOW with a capital S).

If I can do it by myself instead of having the driver do it automatically for me then why not take advantage of this fact ?

Why is DX11 sometimes faster than DX12? Because the driver can be smarter than the programmer! The programmer has the advantage of understanding the workload better, but the driver has the advantage of understanding the hardware better. Usually understanding the workload provides better advantages, but not always! Let's pretend BF5 RT was done last year with "DXR 1.1". How could they possibly know if they scheduled their rays optimally on RDNA2? The hardware didn't exist when they added DXR! Perhaps they accidentally hit a "slow path" on RDNA2 that maybe the driver wouldn't have...

We must have very different experiences then because OpenGL

Yeah OpenGL sucks, what do you want from me? Let me ask you a question, do you think DX11 is "bad"?
 
This is why information can be dangerous. Yes Nvidia now includes an extra register per vector lane to keep track of program counters. No this doesn't make things magically fast. It makes things less slow (you still loose all efficiency during divergence, it just makes scheduling significantly easier during divergence within a warp...diverging is still SLOW with a capital S).

Of course you still pay for the cost of executing a divergent path during shading but you can't deny that independent thread scheduling cuts the overhead cost of doing multiple function calls.

Why is DX11 sometimes faster than DX12? Because the driver can be smarter than the programmer! The programmer has the advantage of understanding the workload better, but the driver has the advantage of understanding the hardware better. Usually understanding the workload provides better advantages, but not always! Let's pretend BF5 RT was done last year with "DXR 1.1". How could they possibly know if they scheduled their rays optimally on RDNA2? The hardware didn't exist when they added DXR! Perhaps they accidentally hit a "slow path" on RDNA2 that maybe the driver wouldn't have...

It's hard predicting where hardware design will be headed in the extended future so I wouldn't be surprised to see the example you brought up is not up to performance but DXR 1.1 on the other hand is here to already define the present and hardware coming in the near future. I get your point that real world profiling wouldn't hurt.

Yeah OpenGL sucks, what do you want from me? Let me ask you a question, do you think DX11 is "bad"?

As far as D3D11 is concerned, it's days are pretty much numbered. The fxc compiler and DXBC have been deprecated so it's only a matter of time before bit rot sets in and stinks up the whole API. D3D11 will be maintained for it's legacy software support and not because some developers will be expecting to develop high performance graphics code in the future.

Eventually D3D11 will just reach the same rotten state that OpenGL is in so why bother considering outdated tools that will become obsolete ? Do you think using unmaintained APIs are somehow less of liability to releasing your new project compared to using a more explicit API ?

Just on principle alone DXR 1.1 should be preferred over 1.0 when it's more cross vendor friendly and it's a newer API which will be better maintained in comparison.
 
Eventually D3D11 will just reach the same rotten state that OpenGL is in so why bother considering outdated tools that will become obsolete ? Do you think using unmaintained APIs are somehow less of liability to releasing your new project compared to using a more explicit API ?
I just assumed because not all developers are capable of coding something lower level. That’s sort of the reason at the very least to keep it around.

Without dx11 graphics would become increasingly small pool of developers without a lot of young blood coming in
 
Of course you still pay for the cost of executing a divergent path during shading but you can't deny that independent thread scheduling cuts the overhead cost of doing multiple function calls.

I don't think this works the way you think this works. The feature you're describing for turing was more about making scheduling possible in various circumstances, not "reducing the cost of multiple function calls". In compute shaders, we have "threads" right? Well they aren't actually "threads" in modern GPU hardware, they are often lanes within a simd/vector unit. Do you know what happens today (pre-volta/turing) if you write a compute shader where thread N waits (e.g. an atomic lock) on thread N-1 where both N and N-1 are part of the same simd/vector unit? It'll deadlock (in hardware the entire vector unit shares the same program counter, so threads within that unit can't advance properly if there's inter-dependencies). It's perfectly valid HLSL code, but it'll still deadlock. And if you send this perfectly valid code to Intel/AMD/Nvidia and claim "BUG, IT'S BROKEN!", do you know what they'll say? "Yeah it's broken and we're not fixing it." The heroics IHVs need to go through to get that kind of complex scheduling just working within a simd/vector unit is too complicated. In practice, no one supports that functionality correctly even though "technically" they should be! Another example of a flexibility vs performance trade-off!

That's all this feature does. It allows (albeit REALLY SLOWLY) the possibility to do complex scheduling within a vector unit. Nothing more! If you think this feature was created for callable shaders, I'm sorry to be blunt but you're simply mistaken. It just makes scheduling within a vector unit feasible (not necessarily fast or desirable)!

Just on principle alone DXR 1.1 should be preferred over 1.0

Stop digging. :D I think it's been established by now there are many use cases where 1.0 is actually beneficial over 1.1. But really, you're comparing two APIs that in all likelihood will be used together!!! It's almost like saying "well now that we have compute shaders, who'd ever use these silly limiting things called pixel shaders???". DXR 1.1 was never designed with the intention of making DXR 1.0 obsolete. I don't understand why you continue to double down on this notion that "DXR 1.0 has outlived it's purpose". That will be true one day, but that day is not DXR 1.1. They are meant to be complimentary!!!

Finally the correct answer was DX11 might be the GOAT of graphics APIs (DX9 was p good for its time too). Was it perfect? Of course not! Isn't it telling though that five years after DX12 came out we still have AAA games that run better on DX11? :cool: (I should also point out that the old fxc compiler tends to still be better than DXIL in practice :mrgreen:) Beating well designed black boxes is harder than you think. After what we have learned from DX12, are you so sure that developers will always schedule rays more optimally than the driver? I certainly wouldn't bet the farm on it! :p
 
I just assumed because not all developers are capable of coding something lower level. That’s sort of the reason at the very least to keep it around.

Without dx11 graphics would become increasingly small pool of developers without a lot of young blood coming in

Then consider moving onto WebGPU as a less explicit option once it releases. Better that than planning to stay on a deprecated tool down the road which will accumulate bugs for years to come in the future. Microsoft made it really clear that they will not insist on maintain the fxc compiler.
 
After what we have learned from DX12, are you so sure that developers will always schedule rays more optimally than the driver? I certainly wouldn't bet the farm on it! :p

DX12 fares very well on AMD GPUs the majority of the time. Its only Nvidia GPUs that always have issues.
 
I don't think this works the way you think this works. The feature you're describing for turing was more about making scheduling possible in various circumstances, not "reducing the cost of multiple function calls". In compute shaders, we have "threads" right? Well they aren't actually "threads" in modern GPU hardware, they are often lanes within a simd/vector unit. Do you know what happens today (pre-volta/turing) if you write a compute shader where thread N waits (e.g. an atomic lock) on thread N-1 where both N and N-1 are part of the same simd/vector unit? It'll deadlock (in hardware the entire vector unit shares the same program counter, so threads within that unit can't advance properly if there's inter-dependencies). It's perfectly valid HLSL code, but it'll still deadlock. And if you send this perfectly valid code to Intel/AMD/Nvidia and claim "BUG, IT'S BROKEN!", do you know what they'll say? "Yeah it's broken and we're not fixing it." The heroics IHVs need to go through to get that kind of complex scheduling just working within a simd/vector unit is too complicated. In practice, no one supports that functionality correctly even though "technically" they should be! Another example of a flexibility vs performance trade-off!

That's all this feature does. It allows (albeit REALLY SLOWLY) the possibility to do complex scheduling within a vector unit. Nothing more! If you think this feature was created for callable shaders, I'm sorry to be blunt but you're simply mistaken. It just makes scheduling within a vector unit feasible (not necessarily fast or desirable)!

Of course I haven't forgotten about the main goal of independent thread scheduling and I'm also well aware of the issues of using a per-thread mutex with GPUs which can cause deadlocks but didn't Microsoft declare that to be undefined behaviour ? I wasn't making the point that independent thread scheduling was specifically made for callable shaders but that it was convenient to use it for that case.

I also knew that callable shaders could be done before independent thread scheduling but on AMD that involved the scalar unit and register file loading and executing multiple shader programs which was ugly.

Stop digging. :D I think it's been established by now there are many use cases where 1.0 is actually beneficial over 1.1. But really, you're comparing two APIs that in all likelihood will be used together!!! It's almost like saying "well now that we have compute shaders, who'd ever use these silly limiting things called pixel shaders???". DXR 1.1 was never designed with the intention of making DXR 1.0 obsolete. I don't understand why you continue to double down on this notion that "DXR 1.0 has outlived it's purpose". That will be true one day, but that day is not DXR 1.1. They are meant to be complimentary!!!

Finally the correct answer was DX11 might be the GOAT of graphics APIs (DX9 was p good for its time too). Was it perfect? Of course not! Isn't it telling though that five years after DX12 came out we still have AAA games that run better on DX11? :cool: (I should also point out that the old fxc compiler tends to still be better than DXIL in practice :mrgreen:) Beating well designed black boxes is harder than you think. After what we have learned from DX12, are you so sure that developers will always schedule rays more optimally than the driver? I certainly wouldn't bet the farm on it! :p

On one IHV sure in the past it may have clearly been the case that D3D11 exhibited better performance for them but on the other IHV it was already an immediate improvement for them since the new API abstracted their hardware better by introducing concepts such a explicit barriers, multiple engines/queues and the new binding model.

On some IHVs it might be better to have the black box since they obviously want to keep it badly just so that they potentially introduce some more shader replacement hacks to win in more benchmarks. With the other IHV, they just want to mostly converge on their hardware design to keep this black box 'maintenance' to a minimum so they'd rather just educate the developers about how their hardware works better.

How "well designed" this black box is relative for each IHV out there.I guess we'll agree to disagree since we won't reach a meaningful consensus.
 
On one IHV sure in the past it may have clearly been the case that D3D11 exhibited better performance for them but on the other IHV it was already an immediate improvement for them since the new API abstracted their hardware better by introducing concepts such a explicit barriers, multiple engines/queues and the new binding model.
That's a gross oversimplifications, many developers don't share this sentiment. It was mostly a pain in the ass for a lot of developers.

Stardock (the developer of the first ever DX12 game: Ashes of Singularity) has recently posted a blog post explaining why they abandoned DX12/Vulkan in their newest game: Star Control, in favor of DX11.

It basically boils down to the extra effort it takes to develop the DX12/Vulkan path, longer loading times and VRAM crashes are common in DX12/Vulkan, if you don't manage everything by hand. Performance uplift is also highly dependent on the type of game, and they only achieved a 20% uplift on the DX12 path of their latest games, which they deemed not worth all the hassle of QA and bug testing.

In the end, they advice developers to pick DX12/Vulkan based on features not performance, Ray Tracing, AI, utilizing 8 CPU cores or more, Compute-Shader based physics ..etc.

https://www.gamasutra.com/blogs/Bra...k1u3fwoIL_QUdwlVhagyopZhQre_YywJ1qxtRiH8Rn0Zo

Some observers have noted that Stardock's most recent two releases, Star Control: Origins and Siege of Centauri were DirectX 11 only.

the biggest difference between the two new graphics stacks and DirectX 11 are that both Vulkan and DirectX 12 support multiple threads to send commands to the GPU simultaneously. GPU multitasking. Hooray. i.e. ID3D12CommandQueue::ExecuteCommandLists (send a bunch of commands and they get handled asynchronously). In DirectX 11, calls to the GPU are handled synchronously. You could end up with a lot of waiting after calling Present().

not all is sunshine and lollipops in DX12 and Vulkan. You are given the power but you also get handed a lot of responsibility. Both stacks give you a lot of rope to hang yourself. DX11 is pretty idiot proof. DX12 and Vulkan, not so much.
Some examples of power and responsibility:
-You manage memory yourself
-You manage various GPU resources yourself
-All the dumb things people do with multiple threads today now apply to the GPU

Here's one: Long load times in DirectX 12 games by default. Is that DX12's fault? No. It's just that many developers will do shader compiling at run-time -- all of them.

In DirectX 11 if I overallocate memory for someone's 2GB GPU, it just throws the rest into main memory for a slow-down. On Vulkan and DX12, if you're not careful, your app crashes.

The power that you get with DirectX 12 and Vulkan translates into an almost effortless 15% performance gain over DirectX 11. In when we put our GFX engineers onto it, we can increase that margin to 20% to 120% depending on the game.

Stardock has DirectX 12 and Vulkan versions of Star Control: Origins. The performance gain is about 20% over DirectX 11. The gain is relatively low because, well, it's Star Control. It's not a graphics intensive game (except for certain particle effects on planets which don't benefit much from the new stacks). So we have to weigh the cost of doubling or tripling our QA compatibility budget with a fairly nominal performance gain. And even now, we run into driver bugs on DirectX 12 and Vulkan that result in crashes or other problems that we just don't have the budget to investigate.

Other trends coming to games will essentially require Vulkan and DirectX 12 to be viable. Ray Tracing, OSL, Decoupled Shading, Compute-Shader based physics and AI only become practical on Vulkan and DirectX 12. And they're coming.
 
Of course I haven't forgotten about the main goal of independent thread scheduling and I'm also well aware of the issues of using a per-thread mutex with GPUs which can cause deadlocks but didn't Microsoft declare that to be undefined behaviour ? I wasn't making the point that independent thread scheduling was specifically made for callable shaders but that it was convenient to use it for that case.

I also knew that callable shaders could be done before independent thread scheduling but on AMD that involved the scalar unit and register file loading and executing multiple shader programs which was ugly.

It's perfectly legal behavior. Do you know how to solve this problem on today's hardware? It has nothing to do with GCN's scaler unit. You solve this problem today by forcing each "thread" in a compute shader to run on the entire simd/vector unit (i.e. treat the vector unit as a scaler unit). Then every compute shader "thread" get its own hardware program counter. Of course this cuts performance by 1/16 (on intel and gcn) and 1/32 (on nvidia and rdna). In practice this would make your shader useless (for real-time rendering at least). In addition, I suspect it's nearly impossible to automatically detect when you'd need to enable this "feature" when compiling a shader, thus literally all your shaders would have to see a huge drop off in performance. Do you see why it's a "won't fix" resolution by AMD/Nvidia/Intel? :D

Now what does Volta/Turing do for us? Well it basically let's us automatically enable the "make my shader really slow" feature. That's it! If you use this feature on turing and schedule each "compute shader thread" as if it's a "real thread", you'll still get the same 1/32 performance. You brushed off the divergence issue in your previous post, but uh...that's like still the main problem! So yes, now Nvidia can automatically enable that feature for you without making everyone else's shaders slow. But here's the thing, if you have huge divergence, your shader is still not going to be practical! That's the big problem with callable shaders (and ray tracing for that matter)!!! It's not the scheduling it's they often diverge!!! GPUs are REALLY BAD at divergence (even when scheduled well!).

I don't understand why you are so hyped for an extra register per vector lane. Do you think all that separated pascal and making callable shaders fast was an extra register per lane? I'm not even sure if this feature is utilized/exposed on the graphics side.

How "well designed" this black box is relative for each IHV out there.I guess we'll agree to disagree since we won't reach a meaningful consensus.

Two questions: 1) Do you think AMD's DX12 path would be faster than AMD's DX11 path as often without async compute? 2) Do you think async compute could have been added to a high level language like DX11?

Answer Key: 1) No 2) Yes

Bonus Answer Key: The (lack of a) black box has nothing to do with AMD's "success" on DX12

Second Bonus Answer Key: You know what's the best part of DX12 and the removal of black boxes? Robustness. DX12 is significantly less error/bug prone than DX11 in practice due it being a "clearer" api. That's the point you should be hammering away. Forget performance, DX12's best selling point is it can save you a lot of (debugging) time!

Third Bonus Answer Key: None of this disproves my original point. Even if we were to blindly accept that poor ole AMD was held back by DX11's big bad black box, there are still cases today where AMD's DX12 is slower. No matter how you slice it there's ample evidence that drivers can beat developers. And it's not even a rare event! If we accept this truth (which I don't really view as controversial), then we can accept there's a time and place for both DXR 1.0 and 1.1.
 
This is what Microsoft actually says about inline RT uber shaders. They seem to think black boxes are still quite useful.

The motivations for this second parallel raytracing system are both the any-shader-stage property as well as being open to the possibility that for certain scenarios the full dynamic- shader-based raytracing system may be overkill. The tradeoff is that by inlining shading work with the caller, the system has far less opportunity to make performance optimizations on behalf of the app. Still, if the app can constrain the complexity of its raytracing related shading work (while inlining with other non raytracing shaders) this path could be a win versus spawning separate shaders with the fully general path.

It is likely that shoehorning fully dynamic shading via heavy uber-shading through inline raytracing will have performance that depends extra heavily on the degree of coherence across threads. Being careful not to lose too much performance here may be a burden largely if not entirely for the the application and it’s data organization as opposed to the system.

https://microsoft.github.io/DirectX-Specs/d3d/Raytracing.html#inline-raytracing

I imagine many PC devs will happily delegate to the driver instead of hand optimizing for every permutation of RT hardware in the market. Inline tracing is a useful tool but it's certainly not the best tool for every job.

This assumes of course that you have a proper DXR 1.0 driver that can handle the optimization.
 
Last edited:
Of course this cuts performance by 1/16 (on intel and gcn) and 1/32 (on nvidia and rdna). In practice this would make your shader useless (for real-time rendering at least).
But it's useful if not about bulk work but work generation, which my request is about.
Leaving the context of doing RT or other massive parallel tasks, let's look at how we handle program flow on coarse level instead. We have two options:
1. Execute shaders, download results to GPU, make depending decision on necessary future work, dispatch new shaders to execute the work. (This is very slow.)
2. Prebuild a command buffer that contains any potential branch of control flow - all indirect dispatches and barriers that could happen, upload only once and GPU can do its thing on its own. (This is what low level API gives me and i get a speed up of two from the previous approach, which is huge win)

The problem with option 2 is that a fixed program flow might be too limited for all cases.
For example, i could not execute a solver in a loop and repeat until some error criterion is met. (Mantle could do this)
But for me the main problem is that the very most indirect dispatches end up doing nothing. Their dispatch parameter ends up zero, but the barriers are still executed and take time. (Profiled 2 years ago on AMD using Vulkan)

Also - and ironically, looking at option 2 we see the same limitations you discuss here that apply within a shader running on parallel cores, which is we execute branches of code for nothing.
How can it be we see this same limitation on the outer, coarsest level of handling overall program flow? Obviously there is something wrong.
And this is why it makes me mad to see introduction of HW RT, mesh shaders and other awesomeness, while not addressing such obvious shortcomings at all for many years.

IIRC, OpenCL 2.0 has some form of shaders that only use one thread. Maybe the purpose is work generation - i do not remember. But it's clear that the inefficiency of a single thread program would not matter if it helps to solve the mentioned problems. We would execute only a tiny number of such work generating shaders.

...just to put my rant into context. I do not request callable shaders and using this for everything then. A 'work shader' that can generate command buffer directly on GPU would already do.
 
It's perfectly legal behavior.

I'm not sure if we're talking about the same thing but attempting global synchronization across different warps/waves is undefined behaviour in any shading language. This could either cause race conditions or in the case of a mutex it could cause a deadlock.

Sure keeping a single lane active is a solution to this or you could just separate the thread groups into different dispatches too.

As for the second half of your post, I'm pretty certain we're not going to see any progress so let's end it there.
 
Looking at some slides for DX12 ultimate one of the bullet points was
* Hardware supporting DX12 ultimate must support ALL features.

Now back in the dx11 days with it's feature level 1, feature level 2 ect I complained that I should just be able to buy a dx11 card and know it will support all dx11 features only to be told that this is bad and having cards that dont support certain features was a good thing. So the question is do b3d readers disagree with microsoft
 
DX12 ultimate logo/card means a gpu that will support the upcoming D3D_FEATURE_LEVEL_12_2, which will include the new geometry pipeline, DXR tier 1.1 and sampler feedback. This is simple. There are also secondary features with single cap-bits only meant for UMA/mobile architectures or multi-gpu support. Pretending that every card will support every single cap bits is no-sense. Feature levels are meant to gather the cap-bits for the most significant features according to the existing and upcoming hardware. You can still query for single cap-bits if you know some architectures not supporting a certain feature level still support some of its features. That's the same as DirectX 10.1/11.x and still far better then Direct3D 9 and previous versions where everything was a cap-bits hell, and still better then OpenGL (which has at least the core profiles since 3.0) and Vulkan with their extension-hell.
 
What it does do is give a nice baseline of feature set.
Could've gone with DX12.5 or 13, but I think this is probably better for marketing for what is a start of a new gen for consoles.

It would be nice to take the DX12U to other parts. Example to be DX12U compliant you need an ssd (DX Velocity) , min 6c12t CPU etc

Then if a studio wants to they can target it knowing they don't need to support non RT cards, HDD etc.
Even years from now will be leaving a lot of users out, but a studio may determine that the saving in production of the game is worth it.

Let's be clear DX12U means >= RDNA2 for amd, that's a lot of users even 2 years from now that won't be supported. Factor in laptops also.
 
Back
Top