Direct3D feature levels discussion

Lurkmass · Dec 18, 2024

If graphics programmers want major API redesigns then it's their onus to push those very limits of graphics programming even if it makes IHVs uncomfortable!

Barriers were recently refactored to be more explicit. Barrier APIs are more complex these days because some hardware in particular requires a more fine grain resource state tracking model with one of the lower common denominator being AMD there ...

If the current resource binding model and "buffer zoo" concept is a sore thumb then we absolutely need more recent examples like Ghosts of Tsushima to push the boundaries of bindless design and show the consequences of what happens to laggards such as Nvidia when emulating more powerful bindless API designs like shader resource tables on inferior designs. Another big reason why bindless hasn't taken off is also down to the fact that major ISVs such as Epic Games with Unreal Engine have been dragging their feet for nearly a decade on implementing this feature ...

Work Graphs could be really helpful for expressing dependencies between different rendering passes if you want to avoid the complexity of using barriers in some cases and it's mesh nodes extension has potential for simplifying PSO management ...

The way I see it, better API designs aren't going to sprout into existence by just simply giving in to waiting out the stalemate. Developers need to do a better job to scare IHVs into submission to extract whatever concessions they want out of them. Largely confining the ISV bullying to just Intel or AMD to a lesser extent won't drive much impetus for API changes because of the lack of pressute to motivate the biggest IHV to do better in some of these aspects ...

Scott_Arm · Dec 18, 2024

@Lurkmass I'm looking forward to Sebastian Aaltonen's upcoming blog post. I've been watching his commentary about the buffer "zoo". My understanding from his tweets is that if a hypothetical DX13 dropped support for pascal, they could actually make a very lightweight api. The newer nvidia hardware from Turing onward should have the capability he expects.

Lurkmass · Dec 18, 2024

Scott_Arm said:
@Lurkmass I'm looking forward to Sebastian Aaltonen's upcoming blog post. I've been watching his commentary about the buffer "zoo". My understanding from his tweets is that if a hypothetical DX13 dropped support for pascal, they could actually make a very lightweight api. The newer nvidia hardware from Turing onward should have the capability he expects.

Ultimately, it's up to the graphics programmers to make the decision on whether or not to stand in line with IHVs because IHVs can't design hardware around hypothetical patterns that don't exist thus I'm skeptical about selling the idea we can improve API design massively which is in of itself subjective ...

Whilst Turing did implement a faster hardware path for bindless constant buffer views, the buffer zoo still very much exists over there with their HW designs as there's still a performance penalty in comparison to using bound constant buffer views ...

trinibwoy · Dec 19, 2024

Lurkmass said:
Whilst Turing did implement a faster hardware path for bindless constant buffer views, the buffer zoo still very much exists over there with their HW designs as there's still a performance penalty in comparison to using bound constant buffer views ...

Why wouldn’t Nvidia voluntarily adopt the more elegant and performant design without being coerced by ISVs? Are the required hardware changes complicated and/or expensive?

techuse · Dec 19, 2024

trinibwoy said:
Why wouldn’t Nvidia voluntarily adopt the more elegant and performant design without being coerced by ISVs? Are the required hardware changes complicated and/or expensive?

I would imagine the driver side changes would also be incredibly complex and time consuming.

trinibwoy · Dec 19, 2024

techuse said:
I would imagine the driver side changes would also be incredibly complex and time consuming.

I was thinking there must be some benefit to the “buffer zoo” otherwise why would any ISV willingly go down that path in a new game.

This sounds like a classic performance vs flexibility tradeoff. ISVs should just choose flexibility every time even if it makes Nvidia look bad. I agree with Lurkmass.

techuse · Dec 19, 2024

trinibwoy said:
I was thinking there must be some benefit to the “buffer zoo” otherwise why would any ISV willingly go down that path in a new game.

This sounds like a classic performance vs flexibility tradeoff. ISVs should just choose flexibility every time even if it makes Nvidia look bad. I agree with Lurkmass.

The buffer zoo might not be a deliberate choice so much as a result of architectural decisions that were made in the past.

I don't know if the performance vs flexibility choice is that clear cut.

Scott_Arm · Dec 19, 2024

Lurkmass said:
Ultimately, it's up to the graphics programmers to make the decision on whether or not to stand in line with IHVs because IHVs can't design hardware around hypothetical patterns that don't exist thus I'm skeptical about selling the idea we can improve API design massively which is in of itself subjective ...

Whilst Turing did implement a faster hardware path for bindless constant buffer views, the buffer zoo still very much exists over there with their HW designs as there's still a performance penalty in comparison to using bound constant buffer views ...

This is the post I was referring to. Outside my pay grade.

https://twitter.com/x/status/1868226166307377431

Looking forward to the full blog post he's working on.

DegustatoR · Dec 19, 2024

trinibwoy said:
Why wouldn’t Nvidia voluntarily adopt the more elegant and performant design without being coerced by ISVs? Are the required hardware changes complicated and/or expensive?

Cause it's elegant only from a graphics programmer perspective and isn't elegant at all from h/w complexity and costs perspective?
I think that we're well past the moment where spending transistors on programmer side QOL was a good idea.

trinibwoy · Dec 19, 2024

DegustatoR said:
Cause it's elegant only from a graphics programmer perspective and isn't elegant at all from h/w complexity and costs perspective?
I think that we're well past the moment where spending transistors on programmer side QOL was a good idea.

My guess is that it’s more elegant from a hardware perspective too but it’s slower. Slower because without the explicit API hints the hardware can’t optimize as well for specific use cases.

It seems like the right thing to do long term although we’re constantly hearing from developers that they want more freedom to program generic hardware and APIs but the results so far haven’t been great. If DirectX 12 was bad I shudder to think what people will do with generic pointers to GPU memory. The horror.

Lurkmass · Dec 20, 2024

trinibwoy said:
Why wouldn’t Nvidia voluntarily adopt the more elegant and performant design without being coerced by ISVs? Are the required hardware changes complicated and/or expensive?

Perhaps but if an API design has consistently sucked for the past several years with little to no signs of improvement in that area what other means do ISVs have to force the issue besides making good on their threat to 'misuse' current APIs to the detriment of that specific hardware vendor ? If we suppose that IHVs are rational actors that mostly look out for their own self interests and that developers were able to make AMD and Intel cave in with ray tracing and GPU-driven rendering (ExecuteIndirect) respectively then it's not out of the realm of possibility that they can change Nvidia's stance to work towards getting rid of the buffer zoo ...

Historically, the best way to get the attention of an IHV is to make benchmarks so that ISVs can dictate hardware design changes for themselves ...

Scott_Arm said:
This is the post I was referring to. Outside my pay grade.

https://twitter.com/x/status/1868226166307377431

Looking forward to the full blog post he's working on.

The buffer zoo on NV hardware still hasn't been entirely eliminated when we look at sebbbi's perftest application. Divergent indexed access to constant buffers (cbuffer{float4} load linear) is almost ~30x slower compared to a randomly accessed typed buffer on Ampere and not using vertex buffers just means that you're wasting fixed function vertex fetch hardware that comes for free over there ...

You can pretend that the buffer zoo doesn't exist on NV HW but may not be ideal for performance ...

trinibwoy · Dec 20, 2024

Lurkmass said:
Perhaps but if an API design has consistently sucked for the past several years with little to no signs of improvement in that area what other means do ISVs have to force the issue besides making good on their threat to 'misuse' current APIs to the detriment of that specific hardware vendor ? If we suppose that IHVs are rational actors that mostly look out for their own self interests and that developers were able to make AMD and Intel cave in with ray tracing and GPU-driven rendering (ExecuteIndirect) respectively then it's not out of the realm of possibility that they can change Nvidia's stance to work towards getting rid of the buffer zoo ...

Historically, the best way to get the attention of an IHV is to make benchmarks so that ISVs can dictate hardware design changes for themselves ...

It’s a valid strategy but one with acceptable consequences when you’re sabotaging products with only 0-10% market share.

What’s not clear from this discussion is whether there are still any benefits to specialized memory paths. Is AMDs streamlined buffer implementation just as fast as Nvidia’s custom paths or did they trade some speed for flexibility? If there’s no downside it seems reasonable that Nvidia will follow suit eventually.

Lurkmass · Dec 20, 2024

trinibwoy said:
It’s a valid strategy but one with acceptable consequences when you’re sabotaging products with only 0-10% market share.

What’s not clear from this discussion is whether there are still any benefits to specialized memory paths. Is AMDs streamlined buffer implementation just as fast as Nvidia’s custom paths or did they trade some speed for flexibility? If there’s no downside it seems reasonable that Nvidia will follow suit eventually.

What design is faster vs slower between is up for interpretation ...

Special memory spaces can potentially be faster if the program in mind shows the optimal memory access patterns for it but it can also be much slower than global/generic/general memory spaces when pathological memory access patterns are observed ...

AMD not having any special memory spaces means that the programmer doesn't have to think about applying memory access pattern optimizations in regards to the differing types of buffers. I guess some graphics programmers prefer the sigh of relief for a convenient programming paradigm whereby users don't have to worry about losing out on performance either due to not making use out this hardware path or using it in the wrong way because it doesn't exist!

DegustatoR · Jan 7, 2025

Enabling Neural Rendering in DirectX: Cooperative Vector Support Coming Soon - DirectX Developer Blog

Neural Rendering: A New Paradigm in 3D Graphics Programming In the constantly advancing landscape of 3D graphics, neural rendering technology represents a significant evolution. Neural rendering broadly defines the suite of techniques that leverage AI/ML to dramatically transform traditional...

devblogs.microsoft.com

So that's what WMMA feature has morphed into I suppose.

DavidGraham · Jan 7, 2025

Holy molly, DirectX will now have access to the tensor cores!

Cooperative vectors will unlock the power of Tensor Cores with neural shading in NVIDIA’s new RTX 50-series hardware. Neural shaders can be used to visualize game assets with AI, better organize geometry for improved path tracing performance and tools to create game characters with photo-realistic visuals. Learn more about NVIDIA’s plans for neural shaders and DirectX here.

DirectX will soon support Cooperative Vectors, which will unlock the power of Tensor Cores on NVIDIA GeForce RTX hardware and enable game developers to fully accelerate neural shaders on Windows

NVIDIA RTX Neural Rendering Introduces Next Era of AI-Powered Graphics Innovation | NVIDIA Technical Blog

NVIDIA today unveiled next-generation hardware for gamers, creators, and developers—the GeForce RTX 50 Series desktop and laptop GPUs. Alongside these GPUs, NVIDIA introduced NVIDIA RTX Kit…

developer.nvidia.com

raytracingfan · Jan 7, 2025

Nvidia won. Everyone who claimed tensor cores couldn't be used for real-time graphics has now been proven wrong, even more so than DLSS has proven them wrong. It's only a matter of time until we see games which won't even run on graphics cards without neural accelerators, just as we are already starting to see games that won't run on cards without RT hardware.

DegustatoR · Jan 7, 2025

raytracingfan said:
It's only a matter of time until we see games which won't even run on graphics cards without neural accelerators

If the next console gen will have them (and PS5Pro already does) then that's pretty much inevitable.

iroboto · Jan 7, 2025

raytracingfan said:
Nvidia won. Everyone who claimed tensor cores couldn't be used for real-time graphics has now been proven wrong, even more so than DLSS has proven them wrong. It's only a matter of time until we see games which won't even run on graphics cards without neural accelerators, just as we are already starting to see games that won't run on cards without RT hardware.

I would say that the industry won. Nvidia is just leading the industry.
If nvidia didn't go this route and we only looked at increasing raw power, the fears of the GPU costs would have probably been realized.

This is going to be the way forward at least until some other form of silicon can get better efficiency than AI algorithms.

Cappuccino · Jan 7, 2025

IQandHDR said:
I kinda want support for them to go away too.
They are pre-RTX.
Time to rip of the bandaid

HUB will be fun to watch when that happens...the crying will be entertaining

This comes across as odd and vindictive. It also won’t mean much as games will probably still come out with DX12 versions for quite a while (PC gamers often forget that consoles still exist and are roughly half the market, with one of them being a straight up DX12 machine).

DX13 should probably drop support for anything pre-mesh shader I agree but even if it came out tomorrow it would still probably take 5 years to have games come out that don’t support DX12 at all.

Ext3h · Jan 7, 2025

Lurkmass said:
What design is faster vs slower between is up for interpretation ...

Special memory spaces can potentially be faster if the program in mind shows the optimal memory access patterns for it but it can also be much slower than global/generic/general memory spaces when pathological memory access patterns are observed ...

AMD not having any special memory spaces means that the programmer doesn't have to think about applying memory access pattern optimizations in regards to the differing types of buffers. I guess some graphics programmers prefer the sigh of relief for a convenient programming paradigm whereby users don't have to worry about losing out on performance either due to not making use out this hardware path or using it in the wrong way because it doesn't exist!

Been pondering about that for a while, and I guess I finally understood how AMDs and NVidia implementations differ in detail.

It appears that the grid context that NVidia is loading onto the SM does actually contain a small table of address ranges and associated instructions for the L0, L1 (possibly also L2, but that one has also explicitly exposed properties) cache that transform reads into any address within that range to a burst reads with a programmable base alignment, size and a bias for cache retention. The later avoids a lot of otherwise punishing L1 cache trashing for traversal of linked structures especially. (And obviously hurts when you use the wrong "buffer type" for such a virtually random access structure.)

That appears to be quite the smart alternative to the concept of temporal/non-temporal prefetches in the X86 world, where different instructions also have a (hidden) different bias for cache retention (e.g. avoiding L1 cache cluttering for vectorized gather loads) while other instructions are biased towards a smaller/larger prefetch window.

On RNDA, we do have programmable L1 instruction prefetch window size, one explicit 64bit and one explicit 128bit data prefetch instruction, but that's about it. Apart from that, the whole cache and prefetch system appears to be a "one size fits all" common ground.

Direct3D feature levels discussion

Lurkmass

Scott_Arm

Lurkmass

trinibwoy

Meh

techuse

trinibwoy

Meh

techuse

Scott_Arm

DegustatoR

trinibwoy

Meh

Lurkmass

trinibwoy

Meh

Lurkmass

DegustatoR

Enabling Neural Rendering in DirectX: Cooperative Vector Support Coming Soon - DirectX Developer Blog

DavidGraham

NVIDIA RTX Neural Rendering Introduces Next Era of AI-Powered Graphics Innovation | NVIDIA Technical Blog

raytracingfan

DegustatoR

iroboto

Daft Funk

Cappuccino

Ext3h