Unreal Engine 5, [UE5 Developer Availability 2022-04-05]

I really wonder how they will do it - that would be really impressive if they manage it.

And of course which hardware are we talking about? Nividia or Intel with honking great HWRT units, or AMD with their somewhat anaemic HWRT abilities?

My guess would be that they're talking mostly about Nvidia hardware, but it would be nice to see consoles and AMD PC GPUs magic up some stellar bump in performance.
 
Sorry, a noob question: how is possible that a software solution beats an hardware one?

In terms of software RT versus hardware RT it's due to the cost of building the BVH for hardware RT combined with the geometric complexity of Nanite which requires using a much lower geometric representation of the nanite geometry in order for things not to completely bog down when using hardware RT.

So, it's a complex issue for Epic on how to get performant hardware RT without having it completely choke on the geometric complexity that Nanite brings. It's something they've been chipping away at over time. Their first attempt at hardware RT looked like ass, IMO, due to just how coarse the abstract geometry was that they used in order to build the BVH. IE - the lighting didn't match the Nanite geometry very well due to just how simplified it was. That meant they needed a proxy mesh that more closely matched the Nanite geometry which in turn meant higher computational costs for the BVH.

Regards,
SB
 
Last edited:
I really wonder how they will do it - that would be really impressive if they manage it.

Maybe they’ve finally figured out how to use all those CPU cores :D

Seriously though, they mentioned in the talk posted earlier that BVH management was a primary bottleneck. Likely they’re working on minimizing the number and scope of BVH updates needed each frame. It’s probably still using a proxy mesh. We’re a very long way away from Nanite level geometry in a BVH.
 
Maybe the next Nvidia GPU will have hardware BVH acceleration. I imagine games would have to be designed specifically or be patched to move this off the CPU?
 
Maybe the next Nvidia GPU will have hardware BVH acceleration. I imagine games would have to be designed specifically or be patched to move this off the CPU?

I’m pretty sure Nvidia is doing BVH updates/construction on the GPU. Not hardware accelerated though.
 
Sorry, a noob question: how is possible that a software solution beats an hardware one?

"Software" here just means it runs on generic parts of the GPU, here the ones that execute most of the job anyway, everything from bloom to lighting to etc. While "hardware" here refers to specially designed parts of the GPU meant to do one task really really fast. The task here is raytracing, using math to take a "ray" through a series of boxes, or "bvh"/bounding volume heirarchy" that are all stacked inside each other until it hits a box with triangles in it. Then that hardware figures out if a ray hits ones of those triangles, and if not it goes back to going through boxes again until it does or the program is told to give up.

The problem for the last 20 years of graphics programming is that it takes a long time for the hardware teams to design, build, and put out a specialized piece of hardware that does something really really fast, while it takes relatively little time for software types to figure out how to tell the generic parts to do something really fast. So by the time the hardware comes out for doing something really fast there might be a software way of doing it that's so fast it doesn't matter if the hardware team made a really fast piece of hardware, because they're doing it the older, slower way.
 
Yes I think people get overly fixated and confused by "SW" vs "HW" in the context of Lumen. It's a vast simplification, but they are effectively different data structures and algorithms with different tradeoffs in quality and performance across a variety of situations. DirectX Raytracing-style hardware-accelerated triangle raytracing is often what people are referring to when mentioning "hardware RT", but not always. Even within that realm, the amount that is hardware-accelerated vs. implemented in drivers is quite varied between IHVs and hardware generations. The amount that is in application/user control varies a lot between platforms. Similarly there's a ton of raytracing of various data structures that has been common long before DXR, and will continue to be used along-side triangle raytracing most likely; triangles are simply not appropriate for everything.

All this is to say - don't over-generalize here. Lumen SWRT and HWRT are effectively different algorithms and data structures that target different performance and use case points. Neither is exactly the same as other GI implementations in other games... everything has tradeoffs and ultimately should be evaluated on the end quality and performance it achieves.

I’m pretty sure Nvidia is doing BVH updates/construction on the GPU. Not hardware accelerated though.
Yes AFAIK everyone does the BVH build on the GPU, but currently in the equivalent of compute shaders. Dedicated hardware could potentially help but I think folks are reluctant to bake too much into hardware when there's still a lot of question marks on the API side. On PC the API is opaque so this can change over time and the application/game has no direct control over it. In theory this allows IHVs to evolve the implementations more freely but comes with the downside that the API itself is too high level to work for more complex cases with lots of detailed, dynamic geometry. This will almost certainly need to evolve further for things like full detail Nanite DXR to be feasible in games.
 
"Software" here just means it runs on generic parts of the GPU,
Aside: it's kind of funny that we've eventually gotten to this point. Earlier in my career people were eager to call everything that you ran on the GPU "hardware accelerated" :D Then again when you get down to the details even the simple definitions get cloudy as to what is "hardware" vs "software".
 
Maybe the next Nvidia GPU will have hardware BVH acceleration. I imagine games would have to be designed specifically or be patched to move this off the CPU?
I’m pretty sure Nvidia is doing BVH updates/construction on the GPU. Not hardware accelerated though.
It was my understanding that the RT hardware on Nvidia GPUs is for BVH traversal, with build/update in compute.
 
Sorry, a noob question: how is possible that a software solution beats an hardware one?

At a high level it is possible that a software solution may be able to repack the data in a more flexible format than a fixed HW requires.
a Repacking of data might enable new processing algorithms that are better suited for SW processing than HW processing.

HW is mostly fixed in terms of what data can go in, albeit with support for lots of pre determined formats.
Software can continuously evolve, and take advantages of new discoveries made since the HW was implemented.

A simple eg. would be an algorithm that is thought to be single thread bound, being 4x faster on dedicated HW,
and then someone finds a way to thread the same algorithm for effective gains up to 20 threads,
voila! Now your software solution can scale to that many threads, and suddenly SW is faster than HW.

(yes yes, i know i'm oversimplifying a lot here, but it's principle, not a concrete eg. )
 
Sorry, a noob question: how is possible that a software solution beats an hardware one?
The software one is doing less work for a lower quality result.
If it was just that, they should be able to reduce (or increase in other cases) HW quality so performance matches that of SW Lumen.

In terms of software RT versus hardware RT it's due to the cost of building the BVH for hardware RT combined with the geometric complexity of Nanite which requires using a much lower geometric representation of the nanite geometry in order for things not to completely bog down when using hardware RT.

So, it's a complex issue for Epic on how to get performant hardware RT without having it completely choke on the geometric complexity that Nanite brings. It's something they've been chipping away at over time. Their first attempt at hardware RT looked like ass, IMO, due to just how coarse the abstract geometry was that they used in order to build the BVH.
Does not make perfect sense either. You say that low poly proxies give worse results than low resolution SDF volumes, but i assume the low poly proxies are a better approximation than the SDF volumes. SDF volumes can't match visual Nanite detail either without doubt, because memory limits.
In general, to increase resolution by a factor of 2, volume data needs 8 times the data, while surface data like triangles + BVH only needs 4 times the data. From this we can conclude volume data becomes unpractical much more quickly as detail increases, so it's almost impossible SDF gives them higher quality.

I don't know the true reasons either. When the question came first up back then, there was the argument that HW RT has a problem with kit bashed, intersecting models, like those many layers of rocks used to model caves in the first UE5 demo. Basically a ray has to traverse many bottom level BVH structures, which hurts performance.
And i had assumed SDF has an advantage here, because we could use distance at ray entry to trace only the closest model while quickly skipping all others.
But this was wrong. My assumption only holds for the case of a closest point query, like we would use for physics collisions for example.
For a ray intersection test, having distance at ray entry does not help at all. We can't skip other SDF models just because the distance there is larger. They might still have a closer intersection than the model with the shortest distance on entry.
So i was wrong, and like with HWRT, we need to process all models where the ray intersects their bounds. There is no SDF advantage here either.

But there is one case left where an SDF advantage may indeed apply:
In the distance, afaict UE5 calculates a single global and static SDF from all models. So instead 1000 rock models we only have one. No more overlapping of multiple models, resolution is low, so SDF tracing will be fast.
But probably there is no such globally merged low res model for HWRT, removing all the hidden geometry. So a HW ray may still need to traverse multiple BLASs along its way, eventually at a level of detail which is higher than needed.

Maybe the next Nvidia GPU will have hardware BVH acceleration. I imagine games would have to be designed specifically or be patched to move this off the CPU?

We know HWRT has a big CPU cost too, but i don't know why. I speculate that the TLAS is built on CPU maybe, eventually at higher quality than the BLASs which are surely built on GPU.
But that's just guessing. In any case, the API abstraction means IHVs can do what they want, so there would be no need to patch games on changes.

With BVH acceleration you likely mean fixed function HW units specifically to build BVH. Like ImgTech already had long before.
But notice this would not solve the problem we have with Nanite. If only one patch on the model changes detail, a HW builder would still need to rebuild the entire BVH for the whole model from scratch.
HW acceleration would be faster than now maybe, but still too inefficient to be practical for LOD.

That's why i hope we'll never see a HW BVH builder. What we really need is the flexibility to change the BVH locally, to reflect those local changes of detail.
There is no way around that. A HW BVH builder would be just a short sighted waste of time and chip area.
 
If it was just that, they should be able to reduce (or increase in other cases) HW quality so performance matches that of SW Lumen.
They're different algorithms. The SW approach uses coarser representation to do less work and get faster results at lower quality and detail. If you want high quality and detail, you need a different algorithm that matches the real-world geometry a lot closer, requiring a lot more work so it runs a lot slower, even with some of it accelerated by optimised HW units.
 
Aside: it's kind of funny that we've eventually gotten to this point. Earlier in my career people were eager to call everything that you ran on the GPU "hardware accelerated" :D Then again when you get down to the details even the simple definitions get cloudy as to what is "hardware" vs "software".

I may be misremembering but I don’t recall the tech press referring to programmable shading as “hardware accelerated” even in the early DX9 days. Hardware acceleration has usually referred to fixed function non-programmable stuff.
 
They're different algorithms. The SW approach uses coarser representation to do less work and get faster results at lower quality and detail. If you want high quality and detail, you need a different algorithm that matches the real-world geometry a lot closer, requiring a lot more work so it runs a lot slower, even with some of it accelerated by optimised HW units.
Well, maybe. But you would need to point out those differences in algorithms in detail. Currently it sounds more like your personal conclusions and assumptions.
Regardless - my point is that if the goal is to make HW as fast as SW, it should be possible to get there easily by decreasing the detail of the slower method.
Remembering early SW vs. HW comparisons, HW was not better and it did not look like SW would use coarser / worse representation of geometry.
The only real difference i know is that HW handles dynamic objects correctly, while SW ignores them causing uncanny artifacts.
I guess things like the textured convex hull approach for a surface cache, or the screenspace probes they use for final gather, do not differ much when HWRT is enabled.
The differences i expect are thus a need to build BVH, including update dynamic / skinned objects, and finally HWRT vs. SDF sphere tracing.
Like many i would assume HWRT should be faster in almost any case, at least on NV HW. To me the question of why it seems not is still open.
Maybe it's an impression based mostly on 'bad' console HWRT performance? But likely the difference between both approaches is just larger than i think. UE docs do not really go into details, though.

I may be misremembering but I don’t recall the tech press referring to programmable shading as “hardware accelerated” even in the early DX9 days. Hardware acceleration has usually referred to fixed function non-programmable stuff.
That's how it should have been. But i remember it the other way as well. The term 'hardware acceleration' became totally undefined and useless at this time, and it still is if we don't give context.
 
Well, maybe. But you would need to point out those differences in algorithms in detail. Currently it sounds more like your personal conclusions and assumptions.

;)
 
In the distance, afaict UE5 calculates a single global and static SDF from all models. So instead 1000 rock models we only have one. No more overlapping of multiple models, resolution is low, so SDF tracing will be fast.
But probably there is no such globally merged low res model for HWRT, removing all the hidden geometry. So a HW ray may still need to traverse multiple BLASs along its way, eventually at a level of detail which is higher than needed.
The global SDF is indeed used for a lot of the ray query. From the Lumen docs:
The renderer merges Mesh Distance Fields into a Global Distance Field to accelerate tracing. By default, Lumen traces against each mesh's distance field for the first two meters for accuracy, and the merged Global Distance Field for the rest of each ray.

Projects with extreme overlapping instances can control the method Lumen uses with the project setting Software Ray Tracing Mode. Lumen provides two options to choose from:
  • Detail Tracing is the default method and involves tracing against the individual mesh's signed distance field for the highest quality. The first two meters are used for accuracy and the Global Distance Field for the rest of each ray.
  • Global Tracing only traces against the Global Distance Field for each ray for the fastest traces.

We know HWRT has a big CPU cost too, but i don't know why.
There's a lot of other stuff you have to do for RT than just build the acceleration structures. It will vary a lot depending on the engine/content, but a big chunk is dealing with shaders and the shader binding table (which is effectively the bindings for all materials in the scene). It is fairly common for games to rebuild large chunks of the SBT (or the entire thing) every frame, which is generally required for engines that allocate descriptors dynamically each frame. While this is not exactly ideal, most games tend to do this because this is how things worked on previous APIs, and it is a very disruptive change to make it all more retained-mode style. Even if you had more static descriptors (via bindless or otherwise), there is still a fair amount of unavoidable overhead allocating memory, setting up all the data that RT may need to query every frame, and shuffling it off to the GPU. And if you thought dealing with PSOs was bad, there's an entire additional layer of mess with RT PSOs.
 
On another topic:

To this day I have not seen an UE5 game that looks truly awe-inspiring or ground-breaking. Great art direction has been in serious recession for a long time now, which is very unfortunate considering the canvas and tools developers have at their disposal.
 
On another topic:

To this day I have not seen an UE5 game that looks truly awe-inspiring or ground-breaking. Great art direction has been in serious recession for a long time now, which is very unfortunate considering the canvas and tools developers have at their disposal.

This is the only UE5 game that's impressed me on occasion.

Sycamore   29_08_2023 22_02_25.png
 
Back
Top