Direct3D feature levels discussion

So AutoSR is an AI spatial upscaler? like FSR1 + AI kind of solution? I wonder what the difference in quality between FSR1 and AutoSR would be?
 
So AutoSR is an AI spatial upscaler? like FSR1 + AI kind of solution?
Seems so but integrated at OS+driver level meaning that games don't have to do anything to support it.

I do wonder how this will work in practice though:
Auto SR is deeply integrated into the OS to modify the gaming process, ensuring smooth coordination from the driver to the display output. It begins by automatically adjusting the desktop resolution downwards for you, causing games to render at a lower resolution and speeding up frame rates without requiring user intervention. This optimization impacts all on-screen elements during gameplay, including open applications. Yet, these changes are virtually unnoticeable to players in full-screen or windowed borderless modes—where Auto SR shines. Once you exit the game, the desktop swiftly returns to exactly the way you had it.
So no alt-tabbing then?
 
Seems so but integrated at OS+driver level meaning that games don't have to do anything to support it.

I do wonder how this will work in practice though:

So no alt-tabbing then?
Why wouldn't the other apps work scaled up if you alt+tab?
 
So all this is so I don't have to set a lower resolution in the game options ?
and what if I don't want to run my game at a lower resolution ? maybe I want to run my game at a higher resolution and downsample it ?
 
So all this is so I don't have to set a lower resolution in the game options ?
and what if I don't want to run my game at a lower resolution ? maybe I want to run my game at a higher resolution and downsample it ?
Then you need to rely on whatever your gfx ihv offers. This is for iGPUs on SoCs with NPUs and it doesn't take anything away from anyone nor is it forced on.
 
Spicy update for Work Graphs v1.008

Graphics nodes with ordered rasterization​

A possible use for this ordering capability is GPU-driven rendering of transparencies with varying materials. The work graph could be a set of standalone mesh nodes that are all entrypoints, acting as a sort of material palette. The app can call DispatchGraph() with input data from CPU or GPU that represents the sorted order to render geometry + materials, computed earlier, for instance from an earlier work graph. Ordering should only be requested if actually needed, as the extra constraint may limit performance that could be achieved otherwise.
I guess this could be used to render non-virtualized transparent geometry (needs predictable primitive rendering order) on top of a scene which may consist of virtualized opaque geometry (no ordering necessary) in a separate forward pass ...
If a work graph has any nodes that drive graphics, e.g. mesh nodes, and those nodes are graph entrypoints, then invocations of these nodes directly from work graph input (DispatchGraph() input) retire rasterization results to any given render target location (e.g. rendertarget/depth sample location) in the order of these inputs. That is, rasterization resulting from graph input N retires before rasterization resulting from input N+1 etc., even if the invoked graphics nodes are different. Even though rasterization results retire in this order, pixel shader invocations producing them may not be in order - this is just like rasterization behavior outside work graphs.
@Bold Basically their way of adding the Xbox extension for ExecuteIndirect on PC ...
 
Last edited:
Work graphs, for whatever reason, has really piqued my interest. Keep watching and reading all the content even though I've never used D3D in my life.
 
Work graphs, for whatever reason, has really piqued my interest. Keep watching and reading all the content even though I've never used D3D in my life.

There were some interesting points in that presentation. If I’m understanding correctly you can actually build producer/consumer graphs using execute indirect today but you have to take care of all of the memory allocation, dispatch logic and synchronization which is probably too much for an average dev. Work graphs are moving a lot of that responsibility to the drivers and making it easier to write producer/consumer algorithms with some reasonable constraints.

I had originally understood work graphs to be unlocking new capabilities but it’s not clear that’s the case (aside from the mesh shader integration which is cool). Am I reading this wrong? Are work graphs essentially easy mode execute indirect?

I find it interesting given all the recent demand for low level access to GPUs. Work graphs seem to be going in the opposite direction.
 
There were some interesting points in that presentation. If I’m understanding correctly you can actually build producer/consumer graphs using execute indirect today but you have to take care of all of the memory allocation, dispatch logic and synchronization which is probably too much for an average dev. Work graphs are moving a lot of that responsibility to the drivers and making it easier to write producer/consumer algorithms with some reasonable constraints.

I had originally understood work graphs to be unlocking new capabilities but it’s not clear that’s the case (aside from the mesh shader integration which is cool). Am I reading this wrong? Are work graphs essentially easy mode execute indirect?

I find it interesting given all the recent demand for low level access to GPUs. Work graphs seem to be going in the opposite direction.
While work graphs do have a more implicit API design, there are genuinely new capabilities augmented with it that's not found in the prior ExecuteIndirect API. If you want to do recursive compaction/expansion of complex hierarchal data structures as seen in Nanite's persistent hierarchal LoD culling or procedural rendering algorithms, you want this done in a "barrierless fashion" as is the case with persistent threads. You don't want these barriers between these chains of indirect dispatches since that idles the GPU. When Epic Games were implementing persistent threads on consoles, they weren't just seeing savings from empty draw compaction but they were also seeing savings due to the fact that they didn't need to issue a barrier while processing a new DAG level!

Work graphs allows us to schedule this work more efficiently (no barriers!) and let's us avoid the dangers (potential deadlocks) of persistent threads or other clever synchronization techniques with forward progress guarantees (no deadlocks!) for our entire graph dispatch ...

Other functionality you may have alluded to is GPU generated commands to do PSO swapping which comes with graphics nodes and limited support for self-node recursion ...
 
When Epic Games were implementing persistent threads on consoles, they weren't just seeing savings from empty draw compaction but they were also seeing savings due to the fact that they didn't need to issue a barrier while processing a new DAG level!

Is Epic using execute indirect at all in their implementation or is it all custom persistent thread kung fu?
 
Is Epic using execute indirect at all in their implementation or is it all custom persistent thread kung fu?
Specifically for their persistent hierarchal LoD culling, they do a cooperative dispatch where the threads in a wave get reused (hence the name "persistent threads") to grab jobs that are in the global work queue that's guarded by an atomic lock until it's emptied ...

On PC they can't do persistent threads because there's no promise of forward progress guarantees that the threads won't deadlock while busy waiting on a global atomic spinlock since GPUs have varying conditions for forward progress guarantees. On older AMD GPUs, if threads are busy waiting/stalled on any s_waitcnt instructions then they will NEVER be rescheduled (deadlock) even if they are actively occupied on a SIMD due to the oldest first policy model. Epic Games carefully designed their algorithm to avoid this pitfall and they don't have to worry about any driver update shenanigans going on (whether or not the updated compiler will emit s_waitcnt while threads busy wait) so they're able to exploit the property of forward progress ... (persistent threads w/ no deadlocks!)

I surmise that the reason why AMD implemented the MES unit was because they introduced a new scheduling policy mode (FWD_PROGRESS) for the purposes of guaranteeing forward progress in more cases during a cooperative dispatch. I suspect that when using work graphs to trigger a cooperative dispatch, the driver will toggle on the new scheduling policy to bypass/workaround the compiler so that all active threads will be scheduled regardless to avoid any possible deadlocks in program execution ...
 
There were some interesting points in that presentation. If I’m understanding correctly you can actually build producer/consumer graphs using execute indirect today but you have to take care of all of the memory allocation, dispatch logic and synchronization which is probably too much for an average dev.
The point was, that you have to allocate up-front on the CPU with ExecuteIndirect. There's no allocator in HLSL. If the CPU doesn't know the number of processed data-elements (which it doesn't with ExecuteIndirect), then the CPU has to allocate the conservative maximum that the algorithm requires. The aggregate of conservative max allocations over all ExecuteIndirect stages of the algorithm might be larger than your GPU memory. Alternatively if you cap the memory, then it limits the number of elements supported by the algorithm.

Imagine you have 10000 materials, and a full-screen buffer, and 4 stages. If you bin materials you could have 3*10000*screensize in conservative temporary memory allocations, because nobody knows which materials are used: each of the 10000 materials could be in every pixel at some time. The algorithm couldn't possibly know up-font. You also have 10003 dispatch calls in you C++ code with 10003 distinct state objects, but fortunately only 4 barriers.

Because of that you have severe hard limits on the complexity of your ExecuteIndirect pipelines, hard limits you can not break in practice. In WG you have a built-in allocator for node-to-node data-passing. For as long as your data-size doesn't surpass the limits in the specs, you can write algorithms of unrestricted complexity. The work graphs scheduler can cap the memory consumption of the running data-set (vs. the memory consumption of the whole timeline data-set) while ensuring forward progress.
 
Back
Top