DX12, DX12U, DXR, API bias, and API evolution ..

So in response to several posts (mostly by @Lurkmass ) that suggested the current DXR API is biased towards NVIDIA alone, and that AMD and Intel should fork themselves out of it, or that Microsoft should allow IHVs proprietary extensions to DXR, I've created this thread to continue advancing this discussion.

It's worth noting that AMD already tried this before, when they got fed up with DX11, they forked themselves a new API "Mantle", to pressure Microsoft into releasing DX12, which they Microsot did indeed, Microsoft even made the API similar to Mantle, causing overhead and problems for non AMD hardware, as pointed out by Lurmkass himself too.

The D3D12 binding model causes some grief on Nvidia HW. Microsoft forgot to include STATIC descriptors in RS 1.0 which then got fixed with RS 1.1 but no developers use RS 1.1 so in the end Nvidia likely have app profiles or game specific hacks in their drivers.

Is the current DXR situation similar to DX12 situation from before?
 
So in response to several posts (mostly by @Lurkmass ) that suggested the current DXR API is biased towards NVIDIA alone, and that AMD and Intel should fork themselves out of it, or that Microsoft should allow IHVs proprietary extensions to DXR, I've created this thread to continue advancing this discussion.

It's worth noting that AMD already tried this before, when they got fed up with DX11, they forked themselves a new API "Mantle", to pressure Microsoft into releasing DX12, which they Microsot did indeed, Microsoft even made the API similar to Mantle, causing overhead and problems for non AMD hardware, as pointed out by Lurmkass himself too.



Is the current DXR situation similar to DX12 situation from before?
I don't see the parallel between forking out of DXR and creating Mantle.
Mantle was fundamentally new type of API for PC gfx spearheaded by AMD and game devs and it literally became Vulkan after AMD gave it to Khronos to build upon.
Forking out of DXR would probably just happen via each manufacturers already used custom APIs built to work with DirectX among other APIs (AGS for AMD, OneAPI for Intel, NVAPI for NVIDIA) via extensions (yes, DirectX extensions are a thing too, even when they're not official)
 
Mantle was fundamentally new type of API for PC gfx spearheaded by AMD and game devs and it literally became Vulkan after AMD gave it to Khronos to build upon.

Vulkan might borrow the basic API design concepts of Mantle, but it was significantly dumbed down to fit the common denominator. If I remember correctly, Mantle offered arbitrary levels of indirection (nested table pointers) for resource descriptors, something that Vulkan still lacks (I might be wrong, but I don't think even the new descriptor buffer extension supports indirection?).
 
a developer told me some time ago in person and in real life that life was easier with DX9 and used the success of a game called "They are billions" -iirc- to describe that. It was easier to write DX9 games.

He then told me that after DX11, and specially DX12 now it is much more complicated.

Still I prefer the fact that DX12 is low level compared to DX11 and before. Intel just don't bother much with previous APIs cos of that.

Performance wise, games ran better for me in DX11 mode when I had the GTX 1080, still, in RE2 Remake, one of my fav games ever, I preferred to use DX12 'cos of some weird bug during the Claire campaign that I only suffered in DX11 -sometimes (rare) the screen went pitch dark after the computer screen scene at the very beginning for no apparent reason, which was a game breaker and impossible to fix- but not in DX 12 mode.
 
Last edited:
a developer told me some time ago in person and in real life that life was easier with DX9 and used the success of a game called "They are billions" -iirc- to describe that. It was easier to write DX9 games.

He then told me that after DX11, and specially DX12 now it is much more complicated.

Still I prefer the fact that DX12 is low level compared to DX11 and before. Intel just don't bother much with previous APIs cos of that.

Performance wise, games ran better for me in DX11 mode when I had the GTX 1080, still, in RE2 Remake, one of my fav games ever, I preferred to use DX12 'cos of some weird bug during the Claire campaign that I only suffered in DX11 -sometimes (rare) the screen went pitch dark after the computer screen scene at the very beginning for no apparent reason, which was a game breaker and impossible to fix- but not in DX 12 mode.

API design is hard. Besides, there are a lot of different types of hardware, each with their own quirks. Both Vulkan and DX12 aim to deliver a low-level interface which at the same time is abstract enough to work well for different hardware — truly a Sisyphean task. It is unfortunate that we end up with overly complicated, verbose and hard to use APIs, but I don't think there was any other way to get there.

Sometimes I wonder whether it wouldn't be better to first negotiate a certain level of hardware capabilities in the industry before going full steam on low-level APIs. Could have resulted in simpler abstractions. That's something Apple did with Metal and I think they had the right idea. Their API is adaptive: if you don't want to deal with the low-level details, you can have automatic memory and hazards management as well as a simple slot-based DX9-style binding API. If you care about performance and want to customise things, well, you can opt in to do full manual memory/synchronization handling and you can build your resource binding model yourself: Metal gives you memory buffers and resource pointers that you can use as you please — it's just C++, structs and pointers (e.g. you can build a linked lists of textures interleaved with arbitrary data, material graphs and anything else, really). I wonder why neither Microsoft nor Kronos went with this kind of approach. We know for sure it works on recent AMD and Intel GPUs (as Metal 3 is implemented on their hardware), but I suppose that there is at least one important vendor (or popular hardware) where this wouldn't be feasible...
 
Some related points i've catched up, but don't nail me on them:

DXR 1.0 with separated ray generation and hit point shaders is preferred for NV and highly recommended for Intel.

DXR 1.1, meaning the inline tracing feature, where we put a traceRay function in a compute shader and then process the result in teh same shader, is often faster for AMD.
But inline tracing prevents any reordering optimizations the HW might do (traversal reordering), or actually does for sure (hit point sorting for coherent shading).

Likely i got some details wrong, but i guess that's the topic aimed to be discussed.
Saw this benchmark results, which shows perf. difference actually: https://tellusim.com/rt-perf/
Sadly Intel is missing. They have newer blogs including it but misses RQ (inline) vs RT (1.0) comparision.

Afaik, Apples new RT API only support queries, this differs from DXR. Connecting the dots, i assume a correlation between RQ and inline tracing, but might be wrong.


My personal opinion:

On DXR announcement i was surprised about it's complexity regarding generation, hit shaders, but mostly even recursion support. I think this is a bit too ambitious for a start and indeed makes IHV adoption difficult.

For games i would have expected to give a big array of rays, Trace it as a batch, get hitpoints back in another big array. With arrays living in Vram.
This would still enable traversal reordering, ray binning to improve ray coherence at launch, and sorting returned hitpoints by material if requested.
Downside would be the need for big Vram arrays instead keeping rays on chip, which might hurt performance to some amount. Recursion would be a matter of the software implementation.
Would be interesting to look at Apples RT API to see if they're closer to such design.

However, as you might guess, i do not really care about those details. Either is fine. DXR has much larger issues than that. ;)
 
Afaik, Apples new RT API only support queries, this differs from DXR. Connecting the dots, i assume a correlation between RQ and inline tracing, but might be wrong.

They added inline tracing in 2021, it's just confusingly they call this "intersection query API" and the older method where the intersector calls an intersection function is called "intersector API". You can do inline RT with Metal in any kind of shader program as far as I remember, but please take my words with a grain of salt since I only looked at the APIs superficially.

For games i would have expected to give a big array of rays, Trace it as a batch, get hitpoints back in another big array. With arrays living in Vram.
This would still enable traversal reordering, ray binning to improve ray coherence at launch, and sorting returned hitpoints by material if requested.
Downside would be the need for big Vram arrays instead keeping rays on chip, which might hurt performance to some amount. Recursion would be a matter of the software implementation.
Would be interesting to look at Apples RT API to see if they're closer to such design.

I think this API is what Apple started with some years ago — as a separate collection of RT-related compute shaders in their "Performance Shaders" library. But this is pretty much deprecated now. The more flexible RT API where you can launch rays from any shader and at any time is definitely easier to work with, but implementing all this in an efficient way might be a nightmare. I don't envy the hardware/driver people that have to deal with that :)
 
DX12 GPUs represent almost 94% of the GPU market right now -next are DirectX 8 GPUs with an almost 4% usage, 🥲 wth?-. Too bad it didn't translate into more DX12 usage from developers. Curiously enough, this past november, the mythical creature, the GTX 1060 got surpassed as the most used GPU in Steam by the GTX 1650. Tbh I didn't check but I wonder whether the GTX 1650 is more powerful than the GTX 1060 or not.

maetPkr.png
 
I think this API is what Apple started with some years ago — as a separate collection of RT-related compute shaders in their "Performance Shaders" library. But this is pretty much deprecated now. The more flexible RT API where you can launch rays from any shader and at any time is definitely easier to work with, but implementing all this in an efficient way might be a nightmare. I don't envy the hardware/driver people that have to deal with that :)
Hehe, looks like Apple approached RT step by step, failing and improving while they go,
instead of constructing a huge pile of complex perfection behind closed doors, and then presenting the bulky and now hard to change result proudly, just to find out it's not really what people need.

I wish those API guys would have just a little bit of Cerny in their mindset... <: )
Invading Dice and Remedy with leather jackets was not enough, and probably much too late anyway.
 
Tbh I didn't check but I wonder whether the GTX 1650 is more powerful than the GTX 1060 or not.
Not really, but does not matter. What matters is the king still wears no rays on his cloth.
But adoption isn't bad either. Roughly summing up i got approx 30% RT GPUs from that.
Maybe 50 is enough to make RT only games, assuming the others which never upgrade don't buy much games either. Still some years...
 
Is the current DXR situation similar to DX12 situation from before?
I never used DX11, but i remember the whining about draw call costs with OpenGL. People complained and tried all kinds of hacks to get improvements. The API evolved multiple times, some legacy staff was deprecated, but the problem never was really solved.

Now it is solved, but only with Vulkan. It added support for prerecorded command buffers, which combined with indirect draw calls allows proper GPU driven rendering. So, in the extreme case, whole graphics engine with one draw call.
It also added support to generate multiple command buffers using multiple CPU threads in parallel, while former OpenGL only gave gfx context to one thread, and the others could not help with turning API commands into actual GPU commands, preparing pipelines, etc.

Just like DX12 Vulkan originated from Mantle. We could discuss some details, e.g. it's said that DX12 way of barriers suits AMD HW better, while Vulkan more follows NV HW. I'm not sure about this. The difference seems, DX12 has split memory barriers. In a first step, you say there is going to come a sync point on resources X and Y, then you may dispatch unrelated work, then comes the second part of the barrier which enforces the sync right before you actually need to co continue work on synchronized data. Makes perfect sense, because the driver could optimize with flushing caches earlier, so work is already done when the barrier is enforced.

But Vulkan lacks this split feature. There is just one piece barrier. Still, the driver could eventually analyze to command buffer to figure the potential optimization out.

So i'm not sure if this barrier example really indicates any vendor preferences. But it's the only interesting 'NV vs. AMD' related example i came across.

We could also discuss how many devs have issues with complexity of low level APIs, which i can confirm to be true. I did only work out the bare minimum of Vulkan rendering so i can display GUI and simple visualizations, but the complexity is at least 10 times that of former OpenGL. There is more uncertainty, more to read and learn, more to forget after that.
It's a burden. But the speedup is worth it. I got a speedup of twofor my compute stuff, so i don't take complaints about complexity as an indication of an actual problem. Rather it seems the devs just whine about the necessary extra work to confirm each other they are not alone. Considering number of engine devs shrinks, that's not really a problem increasing costs, nor can it be the reason many AAA studios switch over to UE.

Some struggle with low level, others achieve great speedups and are happy. Mostly it's both of it in that order. That's expected, not a problem.
I remember you seemingly dislike low level APIs, as you often point out devs complaining, or early ports of engines gave no benefits, etc.
But no. Low level is the way to go if we want to push the HW. That's the goal, and it works much better that it did with high level APIs.
If you still disagree, what you want to discuss is actually adding RT support to DX11 and OpenGL? Imo this should be done indeed.


Now, i often say DXR is not low level, and should have been added to DX11 instead, due to lack of flexibility.
But that's more sarcasm and rant than meant seriously. I do not complain about how DXR works, i only complain about what it is missing: Access to BVH.
Otherwise it does not look bad to me. And since we have both options regarding inline tracing / ray query or abstract shader stages, i do not see the preference towards NV as @Lurkmass may have meant. But he has to clarify this himself, if he's still willing. I can't comment on his quoted details about static descriptors.

I only think that API complexity makes it hard for IHVs to work out the drivers. And i also think that the convenience exposed to devs, e.g. depicting recursion as a standard pattern, is neither meaningful nor helpful for games.
So the complexity is not really worth it, and surely a reason why we may need a new API earlier than expected, because iterating the current may at some point no longer work out well.
An API should be about simple concepts first, but not about ease of use first. But i don't complain about this. My own request is not about the traceRay function at all, and necessary extensions are tangent to what already exists.

That just said to give some feedback. I think it's not really clear yet what you want to discuss.
 
Last edited:
Not really, but does not matter. What matters is the king still wears no rays on his cloth.
the GTX 1060 got surpassed as the most used GPU in Steam by the GTX 1650
Actually no. The true king is the 3060. You just need to combine the percentages for the 3060 laptop with the 3060 desktop, as the 3060 numbers are segregated this way.

The 1060/1650 numbers are not segregated the same way. The 1060 entry includes 1060/6GB, 1060/3GB and their respective laptop versions. The 1650 entry also includes the desktop variants GDDR5/GDDR6 and the laptop variants combined.

Only the 3060 is separated into a desktop entry (4.6%), and a laptop entry (3.4%), combined they are 8%, which is higher than anything else.
 
I remember you seemingly dislike low level APIs, as you often point out devs complaining, or early ports of engines gave no benefits,
I don't dislike them, I dislike their results .. DX12 in particular, which has mostly troubled ports to this day. On the contrary Vulkan is having a blast with it's latest entries, as evident in games like Rainbow Six Siege, Ghost Recon Breakpoint and several others.

Now, i often say DXR is not low level, and should have been added to DX11 instead,
Crytek already added HW RT acceleration to DX11 through a VulkanRT hookup. But that's not what I am asking at all. DXR can't work without the DX12 model anyway.

My own request is not about the traceRay function at all, and necessary extensions are tangent to what already exists.
I understand that perfectly well.

And since we have both options regarding inline tracing / ray query or abstract shader stages, i do not see the preference towards NV as @Lurkmass may have meant. But he has to clarify this himself, if he's still willing. I can't comment on his quoted details about static descriptors.
Yeah, I am waiting on him to clarify his position. I guess both of us are.
 
Last edited:
nevermind, they're in the same table, remembered they had separated laptop and desktop numbers altogether
 
Percentages don't work like that unless both groups have identical number of entries.
Steam counts users, it's not interested in desktop vs laptop numbers. In this case, 4.6% of Steam users are using the desktop 3060, and 3.4% are using the laptop. Combined they stand at 8%. It's also how Steam counts the 1060 and the 1650 percentages.
 
Actually no. The true king is the 3060. You just need to combine the percentages for the 3060 laptop with the 3060 desktop, as the 3060 numbers are segregated this way.

The 1060/1650 numbers are not segregated the same way. The 1060 entry includes 1060/6GB, 1060/3GB and their respective laptop versions. The 1650 entry also includes the desktop variants GDDR5/GDDR6 and the laptop variants combined.

Only the 3060 is separated into a desktop entry (4.6%), and a laptop entry (3.4%), combined they are 8%, which is higher than anything else.
hmmmm interesting, that kinda explains the whole story. That being said, what I find most surprising is that DirectX 8 and below GPUs have increased their usage percentage from a 3,82% in July 2022 to a 6,75% in November 2022. WTH? My only theory for that is that there are less GPUs on Steam :rolleyes: or users with DX8 and below GPUs were less active in the summer.
 
I don't dislike them, I dislike their results .. DX12 in particular, which has mostly troubled ports to this day. On the contrary Vulkan is having a blast with it's latest entries, as evident in games like Rainbow Six Siege, Ghost Recon Breakpoint and several others.
This basically confirms good low level results. But in general, as there is no big differenece between DX12 and VK. VK is even more complicated due to support for mobiles. But overall functionality is more or less the same.
I'm more worried about API priorities from IHVs. AMDs RT is significantly slower in VK for example.
Crytek already added HW RT acceleration to DX11 through a VulkanRT hookup.
Relying on API interop isn't really great, as you loose fine grained control over async execution.
Otherwise i'd use Cuda and OpenCL2 to generate work on GPU, and could stop my whining that gfx APIs don't support this.

Imo, the biggest failure of the tech industry is that we still have no cross platform / cross vendor solution to make the power of GPUs accessible to general programmers.
Every computer system has the HW, but only experts can actually use it in practice. The cost of maintaining compute code is way too high. Existing APIs are too specific, short living and low level for general use.
 
BTW, the recent Marvel's Spider-Man games on PC does an insane 100K descriptor updates per frame(!) with the CopyDescriptors API being the clear offender in terms of CPU overhead on NV which is consistent API performance guide recommendation to not excessively create/copy descriptor every frame ...

On hardware capable of loading descriptors from memory, drivers can apply a workaround to implement the descriptor copies with a trivial CPU memcpy function which will help reduce the amount of CPU overhead for copying descriptors ...
 
In the future i think we will need an api for cpu + gpu (apu)'s, bc they enjoy a lot of benefits like no need to do deep copies and low cpu to gpu communication, the things a discreate gpu could never take advantage of.
 
Back
Top