Unreal Engine 5, [UE5 Developer Availability 2022-04-05]

Indiana Jones is MOTOR engine based on IDtech 7.
I wonder how id Tech would have evolved if id Tech 5 and later versions were licensed to third parties like their predecessors, or at least if Arkane, Tango Gameworks, and MachineGames collaborated with id Software on developing a common version of id Tech to meet the needs of all four studios instead of forking id Tech to make their own engines. The engine would still be focused on first-person shooters, but it would presumably be more versatile and easier to adapt to different games. Would it still have a good reputation for performance? Or would it end up being pushed into use cases it doesn't perform well in like Capcom's RE Engine, or be burdened with supporting a wide variety of use cases like Unreal and Unity?
 
Implementation of a shared API feature is by definition driver and platform dependent. That's like... what a shared api is, lol. The linked blog post is mostly about the differences in the amd and nvidia implementaitons he tested!

(I'm not claiming it's faster -- don't see why it would be -- but I also don't see why it would be in principle slower (which it clearly was for the cases tested by the poster))
Your reply insinuated some sort of expectation of a driver limitation or a change in the APU/GPU performance as it might related to some non-desktop part.

I'm not convinced either of these two items are somehow holding back a level of performance which would decisively help XBOX.
 
I'm not convinced either of these two items are somehow holding back a level of performance which would decisively help XBOX.

It's highly likely a compatibility-breaking console gpu impelements an (at the time) vendor-specific feature by the same company that makes the console differently than a desktop gpu. "help" is apples to oranges, all gpus have different fast or slow paths. My issue here is with misunderstanding the context in which the (original, linked) blog post was shared and extrapolating it to something entirely unrelated, I don't care if mesh shaders are fast on xbox or not (for now -- we don't use them at work!)
 
but there's definitely some room for skepticism for the other parts of the mesh shading API itself like amplification shaders because it is somewhat doubtful that AMD HW sees any performance advantages when it's implemented as compute shaders under the hood so there's not a whole lot of "special hardware" going on to accelerate the functionality ...
I received a 404 off your link but I’m assuming it was this one:

My understanding of graphics coding is fairly limited, but the reason I say hardware is required is because a task shader on AMD HW, despite being done on the asynchronous compute queue is in parallel running on 3D the mesh shader, and both queues communicate with each other to complete the job. It’s just something that I don’t think you can do normally. With API calls.

Compute queue cannot inflight communicate with what’s happening in the 3D queue and vice versa.

There’s a to of extremely heavy lifting being done by the firmware to support this, I think even timur indicates it’s very difficult to get this to work driver side due to the level of interaction between the 2 queues. But more importantly to submit this at the same time on both queues “gang submit” which may actually be the requirement here for task shaders otherwise it’s going to be emulated by submitting into the queues one after the other.

I think it’s something that Sony would have to provide to their developers. But I don’t believe a developer could emulate a task
Shader manually running code on async compute with mesh shader on the 3D side.

I agree that we don’t know what the performance benefits bring, but I don’t agree it’s as simple as a compute shader.
 
Last edited:
There’s a to of extremely heavy lifting being done by the firmware to support this, I think even timur indicates it’s very difficult to get this to work driver side due to the level of interaction between the 2 queues. But more importantly to submit this at the same time on both queues “gang submit” which may actually be the requirement here for task shaders otherwise it’s going to be emulated by submitting into the queues one after the other.

I think it’s something that Sony would have to provide to their developers. But I don’t believe a developer could emulate a task
Shader manually running code on async compute with mesh shader on the 3D side
What do you mean by this ? Open source driver developers actually DID implement task shaders WITHOUT firmware support ...
We already had good support for compute pipelines in RADV (as much as the API needs), but internally in the driver we’ve never had this kind of close cooperation between graphics and compute.

When you use a draw call in a command buffer with a pipeline that has a task shader, RADV must create a hidden, internal compute command buffer. This internal compute command buffer contains the task shader dispatch packet, while the graphics command buffer contains the packet that dispatches the mesh shaders. We must also ensure correct synchronization between these two command buffers according to application barriers ― because of the API mismatch it must work as if the internal compute cmdbuf was part of the graphics cmdbuf. We also need to emit the same descriptors and push constants, etc. When the application submits the graphics queue, this new, internal compute command buffer is then submitted to the async compute queue.

Thus far, this sounds pretty logical and easy.

The actual hard work is to make it possible for the driver to submit work to different queues at the same time. RADV’s queue code was written assuming that there is a 1:1 mapping between radv_queue objects and HW queues. To make task shaders work we must now break this assumption.

So, of course I had to do some crazy refactor to enable this. At the time of writing the AMDGPU Linux kernel driver doesn’t support “gang submit” yet, so I use scheduled dependencies instead. This has the drawback of submitting to the two queues sequentially rather than doing everything in the same submit.
And it's not like developers can't implement task shader functionality on their own with existing APIs already! Task shaders have only one job and that's to generate an indirect draw buffer for consumption by mesh shaders but you don't need the firmware for this. Just using compute shaders alone to generate your indirect command buffer and barriers will get you most of the way there. The firmware is only helpful in the context of the API implementations memory consumption (bounded memory) and for resolving fined grained dependencies (minimize barrier use) which is what the hardware's compute/graphics paired queue ring buffer is for ...

Suffice to say it this firmware capability poses some interesting implications with regards to future API design. The ExecuteIndirect API has very similar problems with indeterminate memory consumption with and potentially excessive use of barriers between dispatches. Why are graphics ('mesh') nodes in Work Graph designed to start with mesh shaders specifically rather than amplification shaders ? Could it be that both the amplification/mesh pipeline and Work Graphs share the same/similar "hardware path" on AMD and that the latter API was intentionally designed to reuse this path (compute/graphics ring buffer) ?
 
What do you mean by this ? Open source driver developers actually DID implement task shaders WITHOUT firmware support ...
Now, it's definitely possible that I've read this wrong. So, don't take my writing as something I've completely understood. I read, but my knowledge in this area is extremely limited. But my comprehension of his sentence is of the following:
In the paragraph just before the one you quoted Timur discusses how Task Shaders actually work on AMD HW.
The task+mesh dispatch packets are different from a regular compute dispatch. The compute and graphics queue firmwares work together in parallel:

  • Compute queue launches up to as many task workgroups as it has space available in the ring buffer.
  • Graphics queue waits until a task workgroup is finished and can launch mesh shader workgroups immediately. Execution of mesh dispatches from a finished task workgroup can therefore overlap with other task workgroups.
  • When a mesh dispatch from the a task workgroup is finished, its slot in the ring buffer can be reused and a new task workgroup can be launched.
  • When the ring buffer is full, the compute queue waits until a mesh dispatch is finished, before launching the next task workgroup.
Then he continues on to discuss the difficulty of the implementation
Side note, getting some implementation details wrong can easily cause a deadlock on the GPU. It is great fun to debug these.

The relevant details here are that most of the hard work is implemented in the firmware (good news, because that means I don’t have to implement it), and that task shaders are executed on an async compute queue and that the driver now has to submit compute and graphics work in parallel.
So my understanding of his writing here is that if the API is able to submit the task shaders on the async queue, and the driver (which Timur is developing) then submits both compute and graphics work in parallel, and leaves it to the firmware to manage the 2 queues working in parallel as quoted above. The latter being the hard part of the implementation.

He is then explicit in the following:
Keep in mind that the API hides this detail and pretends that the mesh shading pipeline is just another graphics pipeline that the application can submit to a graphics queue. So, once again we have a mismatch between the API programming model and what the HW actually does.

So with respect, my perspective is that if there is actual firmware required to do some of this work, if there is something on the firmware side that is allowing these queues to work together without stalling out the GPU, I think there is something there from a hardware perspective that older generations do not have. Otherwise we could have back ported amplification shaders to 5700XT for instance.
Task shaders have only one job and that's to generate an indirect draw buffer for consumption by mesh shaders but you don't need the firmware for this.
Now once again, this could be an incorrect understanding of mine, I'm not here to challenge you on the difference in our understanding of graphics rendering, I clearly know very little compared to yourself, based on how you write. I suspect you work in the mobile space at the very least, or PC indie scene. But Timur writes:

Squeezing a hidden compute pipeline in your graphics​

In order to use this beautiful scheme provided by the firmware, the driver needs to do two things:

  • Create a compute pipeline from the task shader.
  • Submit the task shader work on the asyc compute queue while at the same time also submit the mesh and pixel shader work on the graphics queue.
At least from my perspective, combined with the highlights above, without the firmware for amplification shader support, I don't think there is a way for a developer to emulate a task shader & mesh shader combo without explicitly calling an API whose function is for a task shader. I don't disagree that there are other ways to do this however. I'm just saying, we haven't seen it officially leveraged, but IIRC, Remedy indicated that it would be on their next release and that they found amplification shaders to be useful.

It will take some time to move the entire geometry pipeline. I'm not expecting much until the end of the generation.
 
Back
Top