Unreal Engine 5, [UE5 Developer Availability 2022-04-05]

raytracingfan · Mar 30, 2025

Karamazov said:
Indiana Jones is MOTOR engine based on IDtech 7.

I wonder how id Tech would have evolved if id Tech 5 and later versions were licensed to third parties like their predecessors, or at least if Arkane, Tango Gameworks, and MachineGames collaborated with id Software on developing a common version of id Tech to meet the needs of all four studios instead of forking id Tech to make their own engines. The engine would still be focused on first-person shooters, but it would presumably be more versatile and easier to adapt to different games. Would it still have a good reputation for performance? Or would it end up being pushed into use cases it doesn't perform well in like Capcom's RE Engine, or be burdened with supporting a wide variety of use cases like Unreal and Unity?

Albuquerque · Mar 30, 2025

cwjs said:
Implementation of a shared API feature is by definition driver and platform dependent. That's like... what a shared api is, lol. The linked blog post is mostly about the differences in the amd and nvidia implementaitons he tested!

(I'm not claiming it's faster -- don't see why it would be -- but I also don't see why it would be in principle slower (which it clearly was for the cases tested by the poster))

Your reply insinuated some sort of expectation of a driver limitation or a change in the APU/GPU performance as it might related to some non-desktop part.

I'm not convinced either of these two items are somehow holding back a level of performance which would decisively help XBOX.

cwjs · Mar 30, 2025

Albuquerque said:
I'm not convinced either of these two items are somehow holding back a level of performance which would decisively help XBOX.

It's highly likely a compatibility-breaking console gpu impelements an (at the time) vendor-specific feature by the same company that makes the console differently than a desktop gpu. "help" is apples to oranges, all gpus have different fast or slow paths. My issue here is with misunderstanding the context in which the (original, linked) blog post was shared and extrapolating it to something entirely unrelated, I don't care if mesh shaders are fast on xbox or not (for now -- we don't use them at work!)

iroboto · Mar 30, 2025

Lurkmass said:
but there's definitely some room for skepticism for the other parts of the mesh shading API itself like amplification shaders because it is somewhat doubtful that AMD HW sees any performance advantages when it's implemented as compute shaders under the hood so there's not a whole lot of "special hardware" going on to accelerate the functionality ...

I received a 404 off your link but I’m assuming it was this one:

Task shader driver implementation on AMD HW

Previously, I gave you an introduction to mesh/task shaders and wrote up some details about how mesh shaders are implemented in the driver. But I left out the important details of how task shaders (aka. amplification shaders) work in the driver. In this post, I aim to give you some details about...

timur.hu

My understanding of graphics coding is fairly limited, but the reason I say hardware is required is because a task shader on AMD HW, despite being done on the asynchronous compute queue is in parallel running on 3D the mesh shader, and both queues communicate with each other to complete the job. It’s just something that I don’t think you can do normally. With API calls.

Compute queue cannot inflight communicate with what’s happening in the 3D queue and vice versa.

There’s a to of extremely heavy lifting being done by the firmware to support this, I think even timur indicates it’s very difficult to get this to work driver side due to the level of interaction between the 2 queues. But more importantly to submit this at the same time on both queues “gang submit” which may actually be the requirement here for task shaders otherwise it’s going to be emulated by submitting into the queues one after the other.

I think it’s something that Sony would have to provide to their developers. But I don’t believe a developer could emulate a task
Shader manually running code on async compute with mesh shader on the 3D side.

I agree that we don’t know what the performance benefits bring, but I don’t agree it’s as simple as a compute shader.

Potato Head · Mar 30, 2025

Where’s Andrew been? It’s been three weeks since he’s posted.

Lurkmass · Mar 30, 2025

iroboto said:
There’s a to of extremely heavy lifting being done by the firmware to support this, I think even timur indicates it’s very difficult to get this to work driver side due to the level of interaction between the 2 queues. But more importantly to submit this at the same time on both queues “gang submit” which may actually be the requirement here for task shaders otherwise it’s going to be emulated by submitting into the queues one after the other.

I think it’s something that Sony would have to provide to their developers. But I don’t believe a developer could emulate a task
Shader manually running code on async compute with mesh shader on the 3D side

What do you mean by this ? Open source driver developers actually DID implement task shaders WITHOUT firmware support ...

We already had good support for compute pipelines in RADV (as much as the API needs), but internally in the driver we’ve never had this kind of close cooperation between graphics and compute.

When you use a draw call in a command buffer with a pipeline that has a task shader, RADV must create a hidden, internal compute command buffer. This internal compute command buffer contains the task shader dispatch packet, while the graphics command buffer contains the packet that dispatches the mesh shaders. We must also ensure correct synchronization between these two command buffers according to application barriers ― because of the API mismatch it must work as if the internal compute cmdbuf was part of the graphics cmdbuf. We also need to emit the same descriptors and push constants, etc. When the application submits the graphics queue, this new, internal compute command buffer is then submitted to the async compute queue.

Thus far, this sounds pretty logical and easy.

The actual hard work is to make it possible for the driver to submit work to different queues at the same time. RADV’s queue code was written assuming that there is a 1:1 mapping between radv_queue objects and HW queues. To make task shaders work we must now break this assumption.

So, of course I had to do some crazy refactor to enable this. At the time of writing the AMDGPU Linux kernel driver doesn’t support “gang submit” yet, so I use scheduled dependencies instead. This has the drawback of submitting to the two queues sequentially rather than doing everything in the same submit.

And it's not like developers can't implement task shader functionality on their own with existing APIs already! Task shaders have only one job and that's to generate an indirect draw buffer for consumption by mesh shaders but you don't need the firmware for this. Just using compute shaders alone to generate your indirect command buffer and barriers will get you most of the way there. The firmware is only helpful in the context of the API implementations memory consumption (bounded memory) and for resolving fined grained dependencies (minimize barrier use) which is what the hardware's compute/graphics paired queue ring buffer is for ...

Suffice to say it this firmware capability poses some interesting implications with regards to future API design. The ExecuteIndirect API has very similar problems with indeterminate memory consumption with and potentially excessive use of barriers between dispatches. Why are graphics ('mesh') nodes in Work Graph designed to start with mesh shaders specifically rather than amplification shaders ? Could it be that both the amplification/mesh pipeline and Work Graphs share the same/similar "hardware path" on AMD and that the latter API was intentionally designed to reuse this path (compute/graphics ring buffer) ?

iroboto · Mar 31, 2025

Lurkmass said:
What do you mean by this ? Open source driver developers actually DID implement task shaders WITHOUT firmware support ...

Now, it's definitely possible that I've read this wrong. So, don't take my writing as something I've completely understood. I read, but my knowledge in this area is extremely limited. But my comprehension of his sentence is of the following:
In the paragraph just before the one you quoted Timur discusses how Task Shaders actually work on AMD HW.

The task+mesh dispatch packets are different from a regular compute dispatch. The compute and graphics queue firmwares work together in parallel:

Compute queue launches up to as many task workgroups as it has space available in the ring buffer.

Graphics queue waits until a task workgroup is finished and can launch mesh shader workgroups immediately. Execution of mesh dispatches from a finished task workgroup can therefore overlap with other task workgroups.

When a mesh dispatch from the a task workgroup is finished, its slot in the ring buffer can be reused and a new task workgroup can be launched.

When the ring buffer is full, the compute queue waits until a mesh dispatch is finished, before launching the next task workgroup.

Then he continues on to discuss the difficulty of the implementation

Side note, getting some implementation details wrong can easily cause a deadlock on the GPU. It is great fun to debug these.

The relevant details here are that most of the hard work is implemented in the firmware (good news, because that means I don’t have to implement it), and that task shaders are executed on an async compute queue and that the driver now has to submit compute and graphics work in parallel.

So my understanding of his writing here is that if the API is able to submit the task shaders on the async queue, and the driver (which Timur is developing) then submits both compute and graphics work in parallel, and leaves it to the firmware to manage the 2 queues working in parallel as quoted above. The latter being the hard part of the implementation.

He is then explicit in the following:

Keep in mind that the API hides this detail and pretends that the mesh shading pipeline is just another graphics pipeline that the application can submit to a graphics queue. So, once again we have a mismatch between the API programming model and what the HW actually does.

So with respect, my perspective is that if there is actual firmware required to do some of this work, if there is something on the firmware side that is allowing these queues to work together without stalling out the GPU, I think there is something there from a hardware perspective that older generations do not have. Otherwise we could have back ported amplification shaders to 5700XT for instance.

Lurkmass said:
Task shaders have only one job and that's to generate an indirect draw buffer for consumption by mesh shaders but you don't need the firmware for this.

Now once again, this could be an incorrect understanding of mine, I'm not here to challenge you on the difference in our understanding of graphics rendering, I clearly know very little compared to yourself, based on how you write. I suspect you work in the mobile space at the very least, or PC indie scene. But Timur writes:

Squeezing a hidden compute pipeline in your graphics
In order to use this beautiful scheme provided by the firmware, the driver needs to do two things:

Create a compute pipeline from the task shader.

Submit the task shader work on the asyc compute queue while at the same time also submit the mesh and pixel shader work on the graphics queue.

At least from my perspective, combined with the highlights above, without the firmware for amplification shader support, I don't think there is a way for a developer to emulate a task shader & mesh shader combo without explicitly calling an API whose function is for a task shader. I don't disagree that there are other ways to do this however. I'm just saying, we haven't seen it officially leveraged, but IIRC, Remedy indicated that it would be on their next release and that they found amplification shaders to be useful.

It will take some time to move the entire geometry pipeline. I'm not expecting much until the end of the generation.

Slifer · Apr 16, 2025

Charlietus · Apr 16, 2025

Slifer said:

Looking at those framerates, a 5090 would maybe get a stable 30 fps

Dampf · Apr 16, 2025

Nice they actually made it downloadable this time.

Download here: https://dlss.download.nvidia.com/demos/zorah/ZorahSample_UE5_Source_1.0.0.7z

SlmDnk · Apr 16, 2025

Dampf said:
Nice they actually made it downloadable this time.

Download here: https://dlss.download.nvidia.com/demos/zorah/ZorahSample_UE5_Source_1.0.0.7z

Where did you get that link? Is there perhaps a source page for it?

Dampf · Apr 16, 2025

NVIDIA RTX Kit

Render game assets with AI, create game characters with photo-realistic visuals, and more.

developer.nvidia.com

trinibwoy · Apr 17, 2025

The Mega geometry GDC talks were interesting. Mega geometry really needs to be part of the core UE5 offering. It appears to be designed to work seamlessly with Nanite, even using the same cluster topology. There’s a lot of Nvidia code in there though that should probably be owned by Epic.

raytracingfan · Apr 17, 2025

I don't see that happening until Mega Geometry becomes a cross-vendor standard. Too bad DXR 1.3 didn't include it, hopefully the next DXR version does.

Scott_Arm · Apr 17, 2025

raytracingfan said:
I don't see that happening until Mega Geometry becomes a cross-vendor standard. Too bad DXR 1.3 didn't include it, hopefully the next DXR version does.

Nvidia is pushing for it to become part of the standard. We’ll see.

trinibwoy · Apr 17, 2025

raytracingfan said:
I don't see that happening until Mega Geometry becomes a cross-vendor standard. Too bad DXR 1.3 didn't include it, hopefully the next DXR version does.

Not sure why that should be a pre-requisite. Nanite isn’t a cross-vendor standard. It’s just code Epic wrote. Epic can do the same with mega geometry. Basically bring Nanite clustering to RT. It doesn’t require any new hardware features.

Scott_Arm · Apr 17, 2025

trinibwoy said:
Not sure why that should be a pre-requisite. Nanite isn’t a cross-vendor standard. It’s just code Epic wrote. Epic can do the same with mega geometry. Basically bring Nanite clustering to RT. It doesn’t require any new hardware features.

Hmmm, mega geometry would need driver support. It uses NVAPI. It wouldn't work on AMD or Intel without some kind of standard implementation. I'm not sure how Epic would raise the driver-level code to the application layer in UE.

trinibwoy · Apr 17, 2025

Scott_Arm said:
Hmmm, mega geometry would need driver support. It uses NVAPI. It wouldn't work on AMD or Intel without some kind of standard implementation. I'm not sure how Epic would raise the driver-level code to the application layer in UE.

Isn’t NVAPI just the way to access custom Nvidia libraries? It’s not necessarily low level driver stuff.

Scott_Arm · Apr 17, 2025

trinibwoy said:
Isn’t NVAPI just the way to access custom Nvidia libraries? It’s not necessarily low level driver stuff.

NVAPI is a direct interface to the drivers and gpu. I would guess there's a lot of low level stuff going on with mega geometry. A lot of mega geometry is about how handling memory is allocated and transformed on the gpu for BVH building and refitting stuff. No idea how it couldn't be low level and require driver interfacing.

trinibwoy · Apr 17, 2025

Scott_Arm said:
NVAPI is a direct interface to the drivers and gpu. I would guess there's a lot of low level stuff going on with mega geometry. A lot of mega geometry is about how handling memory is allocated and transformed on the gpu for BVH building and refitting stuff. No idea how it couldn't be low level and require driver interfacing.

Yeah you’re right. BVH stuff in DXR is a black box so all the partitioning and CLAS stuff has to be in the driver.

Unreal Engine 5, [UE5 Developer Availability 2022-04-05]

raytracingfan

Albuquerque

Red-headed step child

cwjs

iroboto

Daft Funk

Task shader driver implementation on AMD HW

Potato Head

Lurkmass

iroboto

Daft Funk

Squeezing a hidden compute pipeline in your graphics

Slifer

Charlietus

Dampf

SlmDnk

Dampf

NVIDIA RTX Kit

trinibwoy

Meh

raytracingfan

Scott_Arm

trinibwoy

Meh

Scott_Arm

trinibwoy

Meh

Scott_Arm

trinibwoy

Meh

Similar threads

Unreal Engine 5, [UE5 Developer Availability 2022-04-05]

Red-headed step child

Daft Funk

Daft Funk

Squeezing a hidden compute pipeline in your graphics​

Meh

Meh

Meh

Meh

Similar threads

Squeezing a hidden compute pipeline in your graphics