Unreal Engine 5, [UE5 Developer Availability 2022-04-05]

Just don't use the current APIs. It's a shame to leave the traversal hardware unused (though it's probably programmable, but the odds of that being exposed is nil) but Nanite leaves the vertex shader hardware unused too.
 
Just don't use the current APIs. It's a shame to leave the traversal hardware unused (though it's probably programmable, but the odds of that being exposed is nil) but Nanite leaves the vertex shader hardware unused too.
Do we have a strong reason to think that the performance of RT without fixed function traversal hardware is actually good enough though? I mean that's basically what you have on AMD and we all know how that compares currently...

I don't think that's really a necessary path for Nanite though. The number of triangles that are actually needed in a given frame is well within stuff you can traverse reasonably - it's just we need to be able to change (and associated BVHs) frame to frame more efficiently. I don't think we really need to do that "just in time" during traversal though... even Nanite raster-based streaming is a feedback loop of course.
 
Skeletal meshes with Nanite gives 2x to 4x increase in fps compated to Skeletal meshes without Nanite, in this special demo.
Froblins on steroids, finally. The promise of an engine mostly in compute finally delivered.

PS. you're late Sweeney.
 
Last edited:
Do we have a strong reason to think that the performance of RT without fixed function traversal hardware is actually good enough though? I mean that's basically what you have on AMD and we all know how that compares currently...
HVVR. Can't really compare tracing rays which mostly stay nicely in a tight frustum for a ray packet, with all the secondary rays being used for GI/AO. Primary/shadow-rays are special, but they aren't what RTX is designed for.
 
HVVR. Can't really compare tracing rays which mostly stay nicely in a tight frustum for a ray packet, with all the secondary rays being used for GI/AO.
I haven't looked at the performance of HVVR specifically but as we know from GPUs, dropping some additional decompression/unpacking/tracing kernel in the middle of another one like this hardly guarantees the performance wouldn't drop by a fair bit. The issues with things like hit shaders are not entirely about transitions from HW/SW traversal stuff, they are as much about fundamental GPU issues with worst case resource allocations and ubershaders as anything, which such a software approach would hit as well.

Primary/shadow-rays are special, but they aren't what RTX is designed for.
Forgive the double negative, but they are also not not-what-they-are-designed for. Coherent ray intersections still see large performance benefits even on NVIDIA hardware due to the many related memory, compression, cache footprint and shader execution coherency improvements that come along for the ride.

We're a bit afield from what I think matters though. Current RT hardware is already fast enough to trace primary/shadow rays from "visible" nanite-level geometry, it just needs to be able to build the acceleration structures faster. That problem also doesn't really go away or get any easier by doing it in the traversal loop. To be clear, it might still be a useful path to support for other cases, but I think it would be more about memory footprint for incoherent random ray casts into offscreen geometry than performance of primary rays. Primary rays are coherent both spatially and temporally - I don't see why you wouldn't just expand those into a structure that is efficient to HW trace as you stream them.
 
Last edited:
Do we have a strong reason to think that the performance of RT without fixed function traversal hardware is actually good enough though? I mean that's basically what you have on AMD and we all know how that compares currently...

I don't think that's really a necessary path for Nanite though. The number of triangles that are actually needed in a given frame is well within stuff you can traverse reasonably - it's just we need to be able to change (and associated BVHs) frame to frame more efficiently. I don't think we really need to do that "just in time" during traversal though... even Nanite raster-based streaming is a feedback loop of course.

With relatively unpredictable and incoherent memory access RT seems likely to be heavily latency bound as soon as you start down incoherent rays. Fixed function units that increase compute throughput per area would seem of limited benefit then versus a programmable pipeline that would be a bit bigger in die area when it's latency/cache structure you have to worry about fairly quickly.

While it's easy to say "what if fixed function cache-" it's just as easy to respond "why can't programmers have some control over cache behavior?" If there's compression/decompression, why can't the compression be a bit more flexible, allow more data types to be compressed/decompressed on the fly? Delta texture compression is already used everywhere in modern GPU hardware, but it doesn't mess with programmability any.

And a programmable RT pipeline offers fixes concentration on hardware box BVHs wouldn't. Go from boxes to spheres, now all you have to do is move sphere centers around, much faster refits and rebuilds. Or what if you can move to splats instead of triangles? Now your geometry testing is faster, you can have simpler acceleration structures altogether because you can just brute force testing geo more. There's tons of possibilities you can dream up, rather than being limited to whatever the hardware guys already gave you.
 
Last edited:
With relatively unpredictable and incoherent memory access RT seems likely to be heavily latency bound whatever you do fairly easily. Fixed function units that increase compute throughput per area would seem of limited benefit then versus a programmable pipeline that would be a bit bigger in die area when it's latency/cache structure you have to worry about fairly quickly.
(Up front do note that the discussion so far has centered on primary rays, so some of my comments have been specifically around that.)

Agreed compute throughput isn't really the issue here for the incoherent rays (although of course with a real RT pipeline it's all over the place depending on the actual data and rays, which is part of the optimization complexity). That said, while it's clear that some stuff about the current black box is suboptimal, it's also not clear to me that some sort of tightly coupled software traversal is really much better.

Consider that we already have several proof points and counterpoints:
- RT pipelines exhibit the expected uber-shader issues around occupancy and payload sizes
- Hit shaders run relatively poorly on all hardware even with ray sorting
- AMD's current software traversal is not very good and while there are certainly some advantages to the flexibility, most of them manifest in terms of not having an opaque TLAS/BLAS formats (and thus can implement some finer grained precomputation and streaming algorithms than on PC) rather than any particularly traversal cleverness that I've seen.
- "Inline" RT/tail recursion/etc. seems to work pretty well in practice, giving the flexibility to do rescheduling or data-aware programmable stuff between large batches of rays, but not trying to inject shader code into performance sensitive inner traversal loops. Of course IMG/PowerVR would have told us this a decade ago...

I think a reasonable analog here would be anisotropic filtering, which despite several attempts over the years has still survived in hardware form. It is also not particularly compute intensive by modern standards and has very coherent access patterns. The key though is that it is a highly data-dependent, variable latency operation (sound a lot like tracing a ray?) and thus funneling requests through a (relatively) long queue to shield the caller from the divergence of execution is still a pretty significant win. This of course has a cost in terms of latency hiding, which manifests as one of the most resource constrained parts of GPUs: the register file size. For RT which has even longer latencies it seems like most folks are trending towards the conclusion that the cost of keeping shader-levels of live state around is not worth the flexibility benefits, but I don't think that question is entirely settled yet.

And a programmable RT pipeline offers fixes concentration on hardware box BVHs wouldn't. Just go from boxes to spheres, now all you have to do is move sphere centers around, much faster refits and rebuilds! Or what if you can move to splats instead of triangles? Now you're geometry testing is faster, you can have simpler acceleration structures altogether because you can just brute force testing geo more.
It's certainly not something to rule out, but there's also just not that many different types of primitives that are commonly used in these acceleration structures. I do expect there to remain some level of tradeoff between structure update costs and ray traversal costs, but it's not clear to me that it's something that needs to be entirely in user space, especially as we move into a world where "general compute" scaling is rapidly slowing.

So it's definitely an interesting discussion and it could go in a few directions, but I do think the critical questions are really all around acceleration structure update, not tracing. If we need to make some sacrifices on the tracing front to make the acceleration structure stuff faster to build and maintain, that's the only place where I think that discussion really matters a lot. The tracing part seems to be mostly understood already, at least for rigid, opaque geometry.
 
I haven't looked at the performance of HVVR specifically...
Skimmed the paper briefly and while the results are interesting in the context they are presented (i.e. very high resolution mobile VR with lens warp), I'm not sure they are as interesting in terms of high end rendering.

The presented case in which HVVR does very well vs other (mostly offline...) raytracers is using very high resolutions/sample counts (2160x1200 with 32x (!!) MSAA). They aren't super clear in the paper but I presume in the competitive RT cases they are actually ray casting all 32x samples individually. Cool, but then their test scenes have a maximum of 350k triangles total... so this is an exercise is massively oversampling the geometry and acceleration structures.

I'm not saying that result is useless or anything and it's interesting work, but I don't think it's very relevant to the high end rendering cases, especially for content like Nanite where the idea is to have at least as many visible triangles as pixels (with some multiple required to handle streaming latency, depth complexity and edge length approximations). I imagine if these systems were compared with ~2 million rays and ~20 million triangles the results would be quite different.
 
Last edited:
Looks like Arc Survival Ascended added Hardware Lumen? In the new version at 4K, the 2080Ti is faster than 6900XT, and the 3080Ti is much faster than the 6900XT.

It's also on gamepass now if people wanted to check it out but not buy the game again at full price.
 
There's no dlss on the Gamepass version for some reason. I'd imagine HW Lumen is missing too until they patch it.
Why the hell does this keep happening? palworld had the same problem and it's also happened on some other things or if dlss isn't missing it's RT features like The Ascent.
 
PC Gamepass Version is always some patches behind Steam and others because of their certification process, while they don't have to go through the same with Steam etc.
That can be the excuse for day 1 releases, but this has been out on steam for a while and had dlss reconstruction and frame gen from day 1 on steam. So for this to be a valid excuse here they would have had to dig out a version pre steam release from months ago to submit for cert?

edit: did a quick check it released on steam 23rd oct 2023, so uh yeh that kin da makes it even more questionable.
 
PC Gamepass Version is always some patches behind Steam and others because of their certification process, while they don't have to go through the same with Steam etc.

Is this also true for MS developed games?

That can be the excuse for day 1 releases, but this has been out on steam for a while and had dlss reconstruction and frame gen from day 1 on steam. So for this to be a valid excuse here they would have had to dig out a version pre steam release from months ago to submit for cert?

edit: did a quick check it released on steam 23rd oct 2023, so uh yeh that kin da makes it even more questionable.

That might the case for this specific game but I don't think that's always the reason. Some games have delayed patches on other storefronts like EGS and GoG as well, and I believe there are some games that have gotten abandoned on other platforms as well, never getting the patches (or even DLC) that the Steam version did. This is kind bringing another issue but due to state of PC Gaming, Steam is likely priority for most developers.
 
Verse seems really interesting, but I think it's only in the UEFN (unreal engine for fortnite) editor and not in standard UE5. Not sure why.

 

Tencent Games' Lightspeed studios published results on their customized implementation of Nanite for mobile graphics hardware. Their custom Nanite renderer sounds like it only offers a hardware rasterization path (no visibility buffer/basepass merged with the rasterizer pass) and they only managed 33ms+ frametimes on a high-end mobile SoC like the Snapdragon 8 Gen 2 in the Amazon Bistro scene ?
 
Back
Top