Unreal Engine 5, [UE5 Developer Availability 2022-04-05]

When is the last time the fixed function rasterizer blocks have seen any notable improvements?

There have been a few improvements over the years aside from the massive increase in throughput - increased granularity (currently 16 pixels), conservative raster, VRS etc.

Nanite doesn’t use the HW rasterizer which was my point. I don’t get the point of your question though.
 
There have been a few improvements over the years aside from the massive increase in throughput - increased granularity (currently 16 pixels), conservative raster, VRS etc.

Nanite doesn’t use the HW rasterizer which was my point. I don’t get the point of your question though.
I was genuinely curious because it seems like not much has been done to improve that aspect.
 
Sure there's alternatives! Throw a high enough order basis function at virtual point lights and you've got your alternative, and you've got a full material representation that follows any brdf and goes over the entire roughness range.
That's basically what i'm doing, but in practice your basis function always lacks support for sharp reflections and hard shadows. I see two opitons:
1. Basis function has uniform (but low res) angular representation of environment, so a patch of surface can use it to illuminate its range of multiple brdfs. Practical because we can share basis data and also can share work while generating this data. But no high frequencies.
2. Basis function has importance sampled environment, fitting the single pixels brdf. Inefficient because no sharing of work and data. We have high freq. but it's too slow. Also we do not really need a basis function if we can't share results with nearby surface because brdf differs.
Now we could discuss doing something in between, which i guess is what you have in mind. But in my experience the sacrifice on performance is much too big, and handling high frequencies with a second technique (RT) is better, also because we can make adaptive to perf budget then.

And how would you generate those basis functions in real time if not with raytracing? Rasterization gives uniform sampling by definition - it can't have a variable density of samples we would need for importance sampling?

Hardware raytracing is really only useful if you have a standard triangle only representation.
Yeah, if we again imagine to replace triangles with pixel sized points, that would become a problem.
Though, RT is expensive, so constructing (eventually lower resolution) geometry just for RT is eventually justified.
Personally i have that surfel hierarchy, and something like that could be used for point rendering. I also use it for raytracing, where each surfel represents a disc, which is a bit less data than a triangle, and also avoids the problem to address 3 vertices scattered in memory.
So do i think such surfels would do better for 'insane detail raytracing' than triangles?
Maybe. But discs can't guarantee to have a closed surface. Rays could go through small holes, so it's imperfect.

Some other guys may come up with requesting RT support for things like parametric patches or other implicit primitives.
But here we have the problem of divergence, and to solve that, binning ray hits to primitives becomes necessary, but still can not guarantee we achieve good threads saturation. So we end up rethinking the whole concept of how GPUs should work and probably giving up their massive parallel processing advantage.

Considering this, i'm fine with triangles. It's just the simplest robust representation of surface there is. We only need to make sure we can generate them efficiently and dynamically from whatever we have.

But it is hard to get around severe performance limitations, thus the need for some sort of hybrid pipeline.
Yes. There is no good an simple solution for everything. Complexity will only increase, many options have to be explored. Thus flexibility > performance.
 
It doesn't necessarily close the door if this happens, but it would certainly put some roadblocks in place on PC.
To be clear: Now that we have HW RT, i want rordering in HW too. It's the only option to help with RT performance. Not only for tracing, but also hit point shading and custom intersection shaders. So we will get it for sure at some time.
But we must ensure dynamic geometry is possible, building and modifying custom BVH is possible, streaming BVH is possible.

I don't think vendors need to have matching specs on BVH data structure. But probably they all use pretty much the same anyway and BVH API would't be that hard to get.
 
Now that we have HW RT, i want rordering in HW too. It's the only option to help with RT performance. Not only for tracing, but also hit point shading and custom intersection shaders. So we will get it for sure at some time.
You can do it in compute or other shaders by yourself, in fact, reordering by material ID has already been implemented in Battlefield 1, UE4 and UE5 for hit point shading, so you can even test it.
It won't help if materials have already been unified and simplified for RT or if there is no material evaluation in hit points because sorting will take execution time without reducing shading time, so it's kind of the same stuff as SW backface culling - sometimes it doesn't make sense and sometimes it does. I played a little bit with it in UE5 and it's almost never a win because there is not enough of shading in hit points to hide the sorting cost.
I guess there is not too much of perf on a table with ray sorting for Turing and Ampere because both use MIMD cores for traversal, so you can probably win something for incoherent rays when performance is bound by memory accesses, assuming you have enough onchip storage for sorting rays by direction over huge windows of pixels (there is not much you can sort with random sampling over hemisphere). Ray sorting has been implemented in the 3D Mark RT feature test.
The sorting itself seems to be rather cheap on SMs, so I am not sure whether dedicated HW or instructions will benefit a lot or justify area cost.
 
Last edited:
You can do it in compute or other shaders by yourself
What i mean with reordering is doing it in the traversal loop, not only for launching rays or hit points.
Goal would be to bin rays to BVH branches to limit incoherent memory access. I remember one good but (pre RTX) NV paper where they did it on a chip simulator. IIRC they called it 'treelets', and they got speedups of something like 2 for secondary rays.
To implement it in software something like traversal shaders would be needed, but i don't think this would be a win at all.
The sorting itself seems to be rather cheap on SMs, so I am not sure whether dedicated HW or instructions will benefit a lot or justify area cost.
Because sorting and binning has so many applications, i do think it might be worth to have HW just for that. It's little ALU and much BW, but so is raytracing.
 
Maybe I’m not following but AMD already imposes no limits on how things are done. Game engines are free to implement pure compute based pipelines as evidenced by Nanite and Lumen.

Investments in hardware accelerated paths won’t close the door on improvements to general compute. We’ve had continuous improvements on both fronts since GPUs have existed.

Exactly, it's a great example of where a general "software" (in this case compute) solution is superior to fixed hardware support.

Fixed hardware support (like the rasterizer or hardware accelerated RT) can do whatever it is designed to do much quicker than more generalized hardware (like compute), but it can also severely limit what you can do thus funneling efforts into a narrow range of activities that is supported by fixed hardware support.

As general capabilities expand it allows for more creative solutions that can be superior to doing it fixed hardware.

If one drives the other than it's a win-win. But if one ends up limiting the other than it becomes questionable if it's a win or not. The rasterizer has been relatively unchanged (minor improvements in speed and some flexibility) and has limited what is done in 3D rendering. But is was performant enough that the limitation was accepted and rarely would anyone deviate from using it. But with the past console generation we saw more attempts to move away from over-reliance on it towards different ways of rendering. And here we have an extension of that into a more generalized engine.

RT is still in it's infancy WRT performant gaming. The question becomes, will NV's hardware acceleration implementation lead something similar to rasterization of triangles leading to relatively little experimentation outside of what could be rasterized? Or will it be the catalyst for increased innovation and experimentation WRT RT beyond what the NV hardware is capable of accelerating?

The NV RT hardware is certainly important and impressive with regard to the level of acceleration it brought to the real-time gaming landscape. But it certainly does have limitations, at least going by what I've read from various developers asking for more flexibility and more control over various aspects of the RT pipeline.

In that respect, the fact that AMD's hardware acceleration is so weak compared to NV's implementation is a sort of backhanded compliment type of boon. Since it's relatively weaker the performance hit of not optimally using it is less than that for the NV hardware. This means that experimentation outside of what it's designed for is less punished as the benefits of using it are less pronounced than on NV hardware.

Of course, the results may also be less impressive than what you can do by adhering to how NV hardware prefers to do RT and how it is designed to handle things within it's black box of hardware accelerated data structures, but it means that you aren't as restricted to having to do it a certain way.

Now imagine if AMD's RT performance was MUCH more performant, let's say for argument's sake that it was 2x as fast as NV's implementation but that it required doing RT somewhat differently than on NV hardware. Now it isn't something you could easily experiment away from as performance would become catastrophic if not using it. This then would enforce doing RT a certain way and discourage experimentation WRT RT.

Perhaps there's a better way to do things. But if you are locked into doing it a certain way because of hardware support, then no experimentation is likely to be done to find that. And that experimentation might have led to better or more efficient ways to do it in hardware. Perhaps the original design choices weren't the best. However, because it was early and the most performant, the possibility is there that the entire industry would get locked into it.

IMO, specific hardware acceleration works best if it enables something that wasn't possible before (like hardware support for vertex and pixel shaders) which then leads to implementation in more generalized and flexible form (universal shaders which lead to general compute shaders). Perhaps at some point experimentation is basically exhausted and there is no longer a need for much experimentation and it's beneficial to do specific hardware support for something again...

Regards,
SB
 
Insightful post. In general I would always champion the less performant but more general programmable solution. Amazing things happen when software finds a way to use hardware creatively in a way that the original hardware designers never intended.

However, you occasionally need the quantum leap that a fixed-function solution provides to get to a place where you would have crawled towards over tens of years otherwise -- even if the fixed function solution just serves as a beacon to show you what is achievable. Parts of that fixed-function solution will gradually revert back to being programmable (or not, if it's a simple fundamental operation that serves as a building block for other things). This is a cycle, and the graphics accelerator itself is an example of this cycle. With the slowing down of Moore's Law, the importance of that fixed-function quantum leap has arguably become larger.
 
The rasterizer has been relatively unchanged (minor improvements in speed and some flexibility) and has limited what is done in 3D rendering.
I would not say the rasterizer has any limitations.
Limitations in the early days were a missing option implement hidden surface removal. But that's no real limitation - it just happened doing brute force and accepting heavy overdraw was faster than doing that on CPU (eventually requiring a CPU software rasterizer as well) and adjusting draw calls.
Same for dynamic LOD: Messiah had it on CPU, but not fast enough to handle the whole world. Other games had LOD systems for terrain which also came out of fashion because brute force did better.
Compute has solved all those issues. But it takes a lot of time until devs get rid of their habbit to rely on GPU brute force.
Up to this point nobody is to blame other than the developers.

Now with RT the situation is different. Anybody should have learned how important flexibility is. But no - all we got is implementation exactly like the decades old standard is, without a way to diverge. And worse - essential building blocks like acceleration structures are black boxed. (Traversal is black boxed too, although i can accept that if handled by fixed function.) Those issues also apply to AMD, as long as they do not expose / extend APIs.
So this time HW vendors and API makers have to be blamed. (Consoles doing well is little consolidation to me.)

The question becomes, will NV's hardware acceleration implementation lead something similar to rasterization of triangles leading to relatively little experimentation outside of what could be rasterized? Or will it be the catalyst for increased innovation and experimentation WRT RT beyond what the NV hardware is capable of accelerating?
I have not seen much innovation in RT games.There is only adoption of known practices with obvious realtime optimizations and compromises.
RT research is focused on:
Improving performance (acceleration structures, parallel traversal, reordering) which is HW only.
Sampling strategies (importance sampling, next event estimation, bidirectional PT, etc.) which is something we adopt from offline if at all (NVs reservoir sampling being the only exception coming to my mind, but if there were no RTX, this research would be presented to address offline needs as usual. Oh, and Eric Heitz stuff about decoupled visibility and shading.)
Denoising, which is probably the most active field contributed form game devs, but the pillars were there also before has RTX arrived.
Material brdf modeling, which also gets a lot contribution from games industry. But that's not really a 'RT only' topic and quite general.

So, there is not so much yet IMO. But there also is not so much to expect. Catching up / adopting known offline solutions covers all we can do and keeps us busy enough. I don't think we'll invent new and super fast sampling strategies just because we need realtime and RT is something new to us.
(Maybe i'll turn out wrong, maybe i have missed some things)

Perhaps there's a better way to do things. But if you are locked into doing it a certain way because of hardware support, then no experimentation is likely to be done to find that. And that experimentation might have led to better or more efficient ways to do it in hardware. Perhaps the original design choices weren't the best. However, because it was early and the most performant, the possibility is there that the entire industry would get locked into it.
Although i'm loud with requesting flexibility, i do not have expectations about 'experimentation' as you say. I mean, raytracing is just that - intersecting rays with triangles. The common saying 'rays transport light physically accurate and like in the real world' is already putting on things to it, and it's bullshit too. It's just a way to test for visibility, and usually we use it to solve the visibility term in the rendering equation. The other interesting bits like shading, integration, etc. have nothing to do with RT.
So what do you have in mind with experimentation? Is it still about lighting, making it faster, or something else like... um, particle collisions maybe?

Coming back to the quote, another way to do those things was shown by CryTek. Good results and no long term limitations or obstacles.
Perhaps this would have been better, yes. Better also in a sense that 'Titan GPUs is the new entry level' would not apply.

IMO, specific hardware acceleration works best if it enables something that wasn't possible before (like hardware support for vertex and pixel shaders) which then leads to implementation in more generalized and flexible form (universal shaders which lead to general compute shaders). Perhaps at some point experimentation is basically exhausted and there is no longer a need for much experimentation and it's beneficial to do specific hardware support for something again...
I know one example:
NV doing acceleration for bezier patches.
Nobody uses it - removed.
Next try: Tessellation shaders. Very similar, but more flexibility.
Useful but still can't do that damn Catmull Clark subdivision everybody wants.
Mesh Shaders.
Yay! Guess everybody is happy now.

So yeah, can happen. But then comes Nanite, which will not really profit from mesh shaders so much and raises questions if ROPs are still needed at all.
In summary, the history here has more failures than proper solutions.
 
Yeah, if we again imagine to replace triangles with pixel sized points, that would become a problem.
Though, RT is expensive, so constructing (eventually lower resolution) geometry just for RT is eventually justified.
Personally i have that surfel hierarchy, and something like that could be used for point rendering. I also use it for raytracing, where each surfel represents a disc, which is a bit less data than a triangle, and also avoids the problem to address 3 vertices scattered in memory.
So do i think such surfels would do better for 'insane detail raytracing' than triangles?
Maybe. But discs can't guarantee to have a closed surface. Rays could go through small holes, so it's imperfect.

I mean, that's the thing though. Dreams works. Sebbi's unlimited detail SDF tracer works, for that virtualized level of detail they already work. Replacing triangles all together works. It's why I find SDFs so fascinating. They work for everything apparently, except very very thin representations, that does need to be worked on. There's long been work showing they're much faster for physics than triangles, now they're showing they're faster than triangles for indirect illumination, as Lumen shows faster performance in software mode than in RT for complex cases that you'd actually find in games, and that's on a 3090 where hardware RT should be fastest compared to compute.

But the real thing I'm thinking about is, it's all the same math. Underlying indirect illumination and direct and physics and animation and etc. it's all just querying and moving surfaces. The idea of exploding complexity is based on the idea that different representations are ideal for different tasks, but I don't see that at all. You're doing the same fundamental thing, all the time, for everything. Collision of bodies is no different from collision of light rays, or sound, or ai visibility.

Thus, SDFs with UV maps for applying materials, and maybe you can raymarch a 2d signed distance field along the normal for amplification via an SDF texture. This can result in what is essentially tessellation in a data efficient manner. And is both how artists like to work, is incredibly flexible and compressible since you're querying the same material across a project rather than trying to compress everything as an individual instance like nanite does, and is still the exact same basic data structure so can be queried by anything you want. I'm not even sure you need surfels, what's the difference between a surfel after all and a volumetrically mipmapped implicit surface? They're the same thing as far as I can tell.

As for heirarchies, you've got that with mipmapping until you get holes that start to disappear. That's why I've mentioned the other surface representation you'd need, which is volumetric. Even if it is, by necessity, possibly a simplification that can't represent the final signal you want, that's a tradeoff you might be willing to make to get rid of fundamental complexity. Bit by bit, with mipmaps and etc. you're slowly fighting entropy anyway, and you can't win there, there's no reducable complexity beyond a point, so you're always making tradeoffs. But that's the idea with rendering, especially with realtime. And heirarchies of volumetric representations beyond a point, either calculated per cluster, per object, or just built out of worldspace altogether, seem ideal. Especially if you're doing something like SGGX, because there you can again just merge and mipmap away in a relatively straightforward manner.

I'm not sure complexity needs to explode, and I'm not sure it's even efficient to do so, is what I'm ultimately getting at.
 
Exactly, it's a great example of where a general "software" (in this case compute) solution is superior to fixed hardware support.

Well this is only true if you confine the problem space to the assets that are compatible with Nanite. It's not a perfect substitute for the classic hardware rasterization path which in some ways has fewer constraints (i.e. no skeletal meshes in Nanite)

But if you are locked into doing it a certain way because of hardware support, then no experimentation is likely to be done to find that.

I don't think I agree with this point. Developers haven't been locked into doing things a certain way for a while now, just look at all of the workloads that have been enabled by general compute apis. The only variable is performance and then you're back to the reason why narrowly focused hardware implementations exist in the first place.

IMO, specific hardware acceleration works best if it enables something that wasn't possible before (like hardware support for vertex and pixel shaders) which then leads to implementation in more generalized and flexible form (universal shaders which lead to general compute shaders).

Isn't that exactly what's happening right now? First generation of RT hardware enables features that weren't possible before. And it will evolve from there and become more flexible over time.
 
I mean, that's the thing though. Dreams works. Sebbi's unlimited detail SDF tracer works, for that virtualized level of detail they already work. Replacing triangles all together works. It's why I find SDFs so fascinating. They work for everything apparently, except very very thin representations, that does need to be worked on. There's long been work showing they're much faster for physics than triangles, now they're showing they're faster than triangles for indirect illumination, as Lumen shows faster performance in software mode than in RT for complex cases that you'd actually find in games, and that's on a 3090 where hardware RT should be fastest compared to compute.
If you want to replace surface triangles with SDF, you also need a 3D texture for material (at least for UVs). So two volumes. That's much more RAM. And you constantly need to search for the surface. SDF is hard to compress (maybe impossible), because doing something like an octree with gradient per cell breaks the smooth signal and resulting small discontinuities break local maxima search or sphere tracing methods i guess.

But the real thing I'm thinking about is, it's all the same math. Underlying indirect illumination and direct and physics and animation and etc. it's all just querying and moving surfaces. The idea of exploding complexity is based on the idea that different representations are ideal for different tasks, but I don't see that at all. You're doing the same fundamental thing, all the time, for everything. Collision of bodies is no different from collision of light rays, or sound, or ai visibility.
I'd really like to know how collision detection between 2 SDF volumes works. Never came across an algorithm. Could be interesting...
Multiple surface representations is just one source of exploding complexity. There also is diffuse vs. specular, transparent vs. opaque, raster vs. rt, static vs. deforming, fluid vs. rigid...

Thus, SDFs with UV maps for applying materials, and maybe you can raymarch a 2d signed distance field along the normal for amplification via an SDF texture. This can result in what is essentially tessellation in a data efficient manner. And is both how artists like to work, is incredibly flexible and compressible since you're querying the same material across a project rather than trying to compress everything as an individual instance like nanite does, and is still the exact same basic data structure so can be queried by anything you want. I'm not even sure you need surfels, what's the difference between a surfel after all and a volumetrically mipmapped implicit surface? They're the same thing as far as I can tell.
I share this detail amplification ideas and will try something here. My goal is to move 'compression using instances' from a per object level to a texture synthesis level. Requires blending of volumetric texture blocks, where SDF is attractive. Instead duplicating whole rocks, we could just duplicate a certain crack over rocky surfaces. LF repetition vs. HF repetition. Surely interesting but very imperfect. Can not do everything, so just another addition to ever increasing complexity.
I've chosen surfels because they can represent thin walls even with aggressive reduced LOD pretty well. Voxels or SDF can't do that, and volume representations are always bloated if we only care about surface and require brute force searching. So they become a win only if geometry is pretty diffuse at the scale of interest. (Generally speaking. It always depends...)

I'm not sure complexity needs to explode, and I'm not sure it's even efficient to do so, is what I'm ultimately getting at.
Unfortunately. We already have too much complexity, e.g. thinking of games using thousands of shaders.
It's not that i rule out searching for the ultimate simple solution which solves everything quickly. But i doubt we'll ever find it. It keeps adding new things, removing some older ones, trying for the best compromise.
 
I found this write up to be illuminating. Apologies if it’s a repost.

http://www.elopezr.com/a-macro-view-of-nanite/

The information above also gives us an insight into cluster size (384 vertices, i.e. 128 triangles), a suspicious multiple of 32 and 64 that is generally chosen to efficiently fill the wavefronts on a GPU. So 3333 clusters are rendered using the hardware, and the dispatch then takes care of the rest of the Nanite geometry. Each group is 128 threads, so my assumption is that each thread processes a triangle (as each cluster is 128 triangles). A whopping ~5 million triangles! These numbers tell us over 90% of the geometry is software rasterized. For shadows the same process is followed, except at the end only depth is output.

One of Nanite’s star features is the visibility buffer. It is a R32G32_UINT texture that contains triangle and depth information for each pixel. At this point no material information is present, so the first 32-bit integer is data necessary to access the properties later.

The material classification pass runs a compute shader that analyzes the fullscreen visibility buffer. This is very important for the next pass. The output of the process is a 20×12 (= 240) pixels R32G32_UINT texture called Material Range that encodes the range of materials present in the 64×64 region represented by each tile.

We have what looks like a drawcall per material ID, and every drawcall is a fullscreen quad chopped up into 240 squares rendered across the screen. One fullscreen drawcall per material? Have they gone mad? Not quite. We mentioned before that the material range texture was 240 pixels, so every quad of this fullscreen drawcall has a corresponding texel. The quad vertices sample this texture and check whether the tile is relevant to them, i.e. whether any pixel in the tile has the material they are going to render. If not, the x coordinate will be set to NaN and the whole quad discarded.
 
My video should be going live today at 17 and at one point I mention a visual artefact with Nanite that I have seen no one else post about before - I do wonder if it is a feature of how nanite functions or if it is just an issue in the current EA version of UE5. I would be curious to hear what people think could be the cause! Essentially there is a some shuffling of nanite when the camera changes position, not anything like tessellation boiling or a discrete LOD shift, but more as if the world pieces shuffle into place as the camera comes to a rest. Unfortunately that is the best way I can describe it - it needs to be seen in video form really.
 
My video should be going live today at 17 and at one point I mention a visual artefact with Nanite that I have seen no one else post about before - I do wonder if it is a feature of how nanite functions or if it is just an issue in the current EA version of UE5. I would be curious to hear what people think could be the cause! Essentially there is a some shuffling of nanite when the camera changes position, not anything like tessellation boiling or a discrete LOD shift, but more as if the world pieces shuffle into place as the camera comes to a rest. Unfortunately that is the best way I can describe it - it needs to be seen in video form really.

1700 hours? Which time zone? :)
 
Back
Top