Unreal Engine 5, [UE5 Developer Availability 2022-04-05]

My two cents on how I guess Nanite will behave on systems with lower I/O performance than the next-gen solutions, or even when next gen I/O is not enough for a given situation that happens to be too demanding.

I'm assuming epic does not want to make an engine that crashes or stutters when things fail to load in time. I mean, there were 1990's 3D games streaming data from CD ROMS with streaming systems more robust and smooth than that.

The easiest assumption to make, is that when geometry data is needed, but hasn't been loaded in time, Nanite will show a Lower LOD of it (the highest one it happens to have in memory by then) and a couple (or several) frames later, the required LOD will have been successfuly streamed into their virtual geo-cache in RAM and the low poly model will be swaped by the High Poly one. Hopefuly, UE5 will have some form of smooth blending to slowly fade or morph from one LOD into the other in these situations so as to make swap less distracting.

Such fading system can't hide obvious swaps happening with large objects way too close the the camera going from the Lowest of the Lower LODs right i to one of the higher ones, granted. But that would be one of the most extreme examples of I/O slowness. A object in the distant BG swapping from one lod level to the next other half-a-second too late, though, would be nearly imperceptible most of the time.

The precedent to make such guess is that this is how most games with a minimally polished streaming sytem do it today. UE3's own texture streaming works exactly like that. Other open world games handle 3D mesh streaming the same way.

That said, if we assume this demo was pushing PS5's IO to its very limits, how would it then look on an XBSX?

Well, the same, except on the moments the camera is moving the fastest (the ones that stress streaming the most) some of the more distant assets would display in slightly lower geometric detail than the nanite's target for twice as many frames more than PS5.

So if when PS5's IO gets pushed the most (when the character is flying through the ruins) some of the assets swap to higher levels of detail as they aproach the camera 1-2 frames later than they should have, that will happen in 2-4 frames in XBSX.

Not the end of the world in my opinion.

Considering the engine is still in development, there are a lot of things it may be lacking still.

Maybe, their streaming is not as polished as it needs to be to not cause stutters and freezing, so they can't run it confidently on hardware that doesn't outperform its requirements 100% of the time.

Maybe they don't have their smooth LOD swapping figured out yet, so those swaps would be more poppy, distracting, and easy to spot (not a good first impression to their -it just works™- system)

In those cases, a PS5 devkit is the safest bet to make a public showing TODAY, as in TODAY that is the machine where you get the most reliable streaming from.

Also, PS5's HW compression uses a more generalized data compression sheme that may provide friendlier prototyping. As I understand XBSX decompressor is texture-centric, so to use it with other data one must encode such data in a texture format (which in essence isn't rocket science) and in a way ends up compression-friendly (that is when things require more testing/refinement). I would guess whatever encodings Nanite uses for geo, they are probably easy to translate into image based formats, it just takes some extra interation time to find the sweet-spot there.

When it comes to PS5 vs PC on Sweeny's comments, as he said "system inegration" It's probably the case that for a dev trying to build a robust high performance engine, they will welcome the ability to go low level when needed, and a console dev-kit is probably the most confortable environment to do that in.

That does not mean that there can't be ways to achieve the same or similar results in other platforms, but when you are still building your engine, the platform of choice will usually gonna be the one where prototyping, testing, debugging, documentation/engineering consulting and general tinkering, is the easiest at. Prototype there first, and then port to the other stuff.

Under those circumstances, it should not be surprising that PS5 ended up being their preferred dev-environment (at least for this demo, for now, and acording to T.S.)

In the future, it might be that with XBSX optimised compression, they end up finding equal or even higher performance there than on PS. They might end up leveraging their extra compute to achieve even higher compression ratios too, thus stressing raw I/O speed less. All is possible as of now. Relax.

Kids can go to bed now, there os no monster under the bed.
 
Last edited by a moderator:
How does compatible SSD drive help? How does that undue the chain of SSD -> controller -> southbridge -> RAM -> Northbridge -> GDDR -> GPU. How does that change anything? Your comments only demonstrate you don't know how your PC works, or how for data to transfer from SSD to RAM you need to involve the device driver for the SSD, the overarching interface driver, and the interconnect bus driver. And that's just for starters. What different does one SSD or another make?

Please. Read the links I posted before. It explains how Windows is architecturally designed.

I, trying to work my way through the link you provided. It's pretty tough going for someone who's just pretty much taken it for granted!

What steps do you think XSX will be able to 'bypass' (avoid?)?

2grsover.png


The subsystem used as an example was Win32, so I guess DirectStorage on XSX would just be communicating directly with the I/O system services, or straight to the I/O Manager, with no intermediate use of a PC style "main ram" (PC DDR) driver, or the PCIe bus driver, because you're just going straight into the single pool of GDDR6...?

Edit: I mean ... that's a lot of steps!

"overview of what happens when a subsystem opens a file object representing a data file on behalf of an application"

2opendev.png
 
Last edited:
The easiest assumption to make, is that when geometry data is needed, but hasn't been loaded in time, Nanite will show a Lower LOD of it (the highest one it happens to have in memory by then) ...
I don't think so. There are no LODs. There's just source models. TBH though we need to know how Epic are actually representing the data. ;)

Also, PS5's HW compression uses a more generalized data compression sheme that may provide friendlier prototyping. As I understand XBSX decompressor is texture-centric, so to use it with other data one must encode such data in a texture format (which in essence isn't rocket science) and in a way ends up compression-friendly (that is when things require more testing/refinement).
XBSX supports ZLib in its hardware block for general compression, and BCPack for textures.

In the future, it might be that with XBSX optimised compression, they end up finding equal or even higher performance there than on PS.
There are quite a few unknowns about the real-world performance of the IOs, so no-one should assume anything really about them and just wait until a better understand makes comparisons meaningful.
 
I don't think so. There are no LODs. There's just source models. TBH though we need to know how Epic are actually representing the data. ;)

I find it honestly impossible that their system does not have it's own form of internal LODing. When they said "no LODs" I understood they meant "no LODs* (*That artists have to consciously worry about, or even know about for that matter)"

And I consider the short moment wherethey show the debug view of the geo (with different ID's) + the tweets that confirm they are using some sort of Compute Triangle Rasterization, as good evidence that they must have some sort of LODing sheme in there.

In the debug mesh view, you can see that most triangles are near-pixel siezed, but not really single pixel precisely. Most of them vary from 2-4 pixels in height/width (at least from what I could dicern, I didn't get the oportunity to download the highest quality 4k video to really sink my teeth in yet. But assuming my first impressions are correct, if they have this semi-regular triangle density across the entire screen, and said screen has minimally varying degrees of depth, that can only mean the near surfaces are more tessellated than the further ones. And those further ones will get more tesselated and detailed as the camera aproaches. That is A form of LODing. Maybe not the typical form of LODing used in modern engines today, sure, but it's LODing nonetheless.

My biggest assumption is that streaming is needed between one LOD level to the next. It could be all their LODs are derived in real time from the High Poly original asset. But I don't think that is very likely.

I just don't think that is a smart way to handle game scenes with large environments, which UE5 is aiming to support.
 
XBSX supports ZLib in its hardware block for general compression, and BCPack for textures.

Ooh, I had either forgotten or missed that detail. In that case SX IO compression is less of a juggling act than I thought. Good job MS.


There are quite a few unknowns about the real-world performance of the IOs, so no-one should assume anything really about them and just wait until a better understand makes comparisons meaningful.

Absolutely. One should only speculate for the pleasure of throwing ideas around, and NOT to try to come to any reliable prediction. The most certain source of how this will turn out is simply waiting for it to actually happen. Don't sweat it guys.
 
How does compatible SSD drive help? How does that undue the chain of SSD -> controller -> southbridge -> RAM -> Northbridge -> GDDR -> GPU. How does that change anything? Your comments only demonstrate you don't know how your PC works,

There's not really a Northbridge in modern PC platforms and the Southbridge (generally just referred to as the Chipset these days) doesn't enter the equation on Zen2 based platforms since the NVMe SSD connects directly to the CPU IO die just like it would in the PS5. If AMD were to enable direct communication between the GPU and SSD via the HBCC (which the hardware already supports but isn't enabled in drivers) then the path from SSD to GPU would be basically the same as the console as far as I can see (aside from one additional very high bandwidth PCIe-16x bus). At least for any data that doesn't need to be decompressed before going into graphics memory anyway.
 
I find it honestly impossible that their system does not have it's own form of internal LODing. When they said "no LODs" I understood they meant "no LODs* (*That artists have to consciously worry about, or even know about for that matter)"

And I consider the short moment wherethey show the debug view of the geo (with different ID's) + the tweets that confirm they are using some sort of Compute Triangle Rasterization, as good evidence that they must have some sort of LODing sheme in there.
Okay, you didn't mean a specific model designed for lower quality.

My guess, as it comes to me while typing, is there has to be a spatial representation, some hierarchy of nodes representing the 3D space. I think this would be shared with Lumen for optimising the light, although they may have two different spatial model for different purposes. Nodes will point to the mesh data on storage through a caching system. Viewing frustrum analysis with prediction will prefetch nodes-worth of geometry. The representation in storage may have LOD tiers, allowing several tiers above the full model to be fetched as an approximation, which I guess is what you're suggesting too.* Depending on how Epic manage this, the streaming requirements in terms of raw BW could be very low. For scaling the game to lower-spec'd devices, the full models make no sense, so I reckon there's a one-button export that simply stops the 'cooking' at a higher tier of LOD. If each tier has four leaves, say, one tier higher would be 1/4 of the object size and one triangle per four pixels, as it were, and two tiers higher would be 1/16th the size.

As for the large pixels in the demo, that was very obvious and they didn't subdivide over time. I took it to be Epic being a bit loose with their definition of 'one triangle per pixel' and there are plenty of larger objects in the source material, because it's quite frankly dumb to tessellate an area of triangles into small triangle if they all lie on the same plane. So perhaps more a case that the scene can be one triangle per pixel, but with the assets they were using, some triangle are much larger?

* The problem I have with this is the inefficiency in fetching data that should be local on the storage but may be spread over different blocks. The requirements for efficient data access might determine what representations are actually usable. Having a super-fast IO platform for R&D and prototyping would definitely be helpful in that regard.
 
Last edited:
@Shifty Geezer @milk They have SDF and voxel representations of the scene for lumen. I wouldn't be surprised if they build a rough scene representation as an SDF first, then ray march it to figure out what parts of the polygon meshes will be visible, and use the distance from the screen to figure out what level of detail they want to load in from the parts of the meshes they've selected. Just my layman's guess.

Edit:
I'm sure they do support larger polygons, and that's when they fall back to the gpu raster hardware instead of their software raster shaders. If you were making a scene in a modern building with seamless tile floor, I'm assuming you'd still use large polygons since it's all meant to be flat and seamless anyway.
 
Yeah, but I wonder if the SDFs are also tied to the overall scene representation? Part of the optimisation of which SDFs to load in memory. Or maybe they are just all in RAM, the whole level, Dreams-style, because they are very efficient?

Virtual geometry is dependent on spatial representation, and so is Lumen, so at first glance you'd think the two would want to be unified. But maybe the use cases are so different that's a bad idea? There'll also be the need to trace light of geometry.

Oh, hang on. Will that even be possible?? How do you build a BVH for hardware tracing with meshes that are being constantly adapted? All geometry would be treated as dynamic without any static scenery you could optimise?
 
There are no LODs. There's just source models.
There must be LODs, otherwise for a model that scales down to few pixels, they'd need to average a million triangles per pixel.
I guess they use some hierarchy over the triangles, and internal nodes store points approximating the triangles at leaf levels. This way it would be only necessary to stream the hierarchy levels that are actually necessary.
That's much more fine grained than traditional methods. If we take the Spiderman game, they have big blocks of city and they need to stream those big blocks as a whole if they need just a bit of it.
But my guess has some issues:

Assuming they have no connectivity information in the internal levels (just points or disconnected triangles), they could only switch LOD at subpixel level, not above it. The demo looks like this, but then i ask how could they scale to low power hardware?
And if they had connectivity (similar to traditional discrete LODs), they have the problem of discontinuity (cracks in the surface) between the levels of hierarchy. Also it would take longer to preprocess the meshes. Import from ZBrush and display instantly would not work, and IIRC they said it's instant. (Did they?)
To solve the cracks there exist some options: Progressive / morphing meshes, having skirts along boundaries to close gaps, or rendering both LODs and mix stochastically.
All this is pain. And hiding LOD switch at subpixel level prevents a need for all this, but it requires enough hardware power.

And second, with all this detail they obviously rely on instancing to have enough content. We can see how the same rock models the whole scene.
But if we assume to zoom out, the advantage from instancing becomes more additional complexity than value. If each rock is only some pixels large, rendering by drwing those instances is more expensive than having a single huge but low detail geometry that contains all those rocks and forms a surface.
So it makes sense to assume the have somewhat bottom level instances and top level hierarchy that would have no triangles at all, just points. (Or micro triangles - whatever it is)
At this scale it becomes harder to imagine how instancing could help with reducing storage, except they would use large meshes that shape mountains or buildings, which then would consequently lack the close up details.

Conclusion: They still rely on low poly landscape for the distance, and instanced detail meshes fade out towards the horizon. Hellblade looked like this, and i have some doupts Nanite can handle LOD at all scales with just one 'simple solution', but not sure.
 
@Shifty Geezer @milk They have SDF and voxel representations of the scene for lumen. I wouldn't be surprised if they build a rough scene representation as an SDF first, then ray march it to figure out what parts of the polygon meshes will be visible, and use the distance from the screen to figure out what level of detail they want to load in from the parts of the meshes they've selected. Just my layman's guess.
I doubt this, because low res volume removes details. Thin or diffuse objects would not be present in the low res volume, so you could not know they have to be loaded or drawn. It would only work for smooth surface like a mountain.

Oh, hang on. Will that even be possible?? How do you build a BVH for hardware tracing with meshes that are being constantly adapted? All geometry would be treated as dynamic without any static scenery you could optimise?
That's what i talk about here all the time, long before UE5 :)
I think we can use discrete LODs with RT, using the displacement idea i've mentioned in the alternate geometry thread.
Then we would only need ability to cache vendor specific BVH to disk so it can be streamed later. We would have chunks of discrete LODs and stream them in and out as necessary. Still fine grained enough to work, but probably similar storage requirements than Nanite stuff itself. :O
If my idea does not work, we need traversal shaders, or more flexible HW RT so we could use the same data structure for both.
This was exactly the reason i was not so excited about HW RT at all.
 
I, trying to work my way through the link you provided. It's pretty tough going for someone who's just pretty much taken it for granted!

This. 100%. With a PC you plug something in and everybody just works. Mostly. You don't see all the behind-the-scenes working and really appreciate how much Windows itself it doing to keep it that way. You don't appreciate there that SSD have a driver and it's own firmware, that the controller does too even if part of a single package with the SSD, nor the bus that you connect the SSD too and that bus may be exist in a higher hierarchy. The SSD is trying to 'help' the native Windows filesystem which introduces it's own levels of abstraction.

What steps do you think XSX will be able to 'bypass' (avoid?)?

I think the biggest change Microsoft could make is having one driver for all of the XSX hardware, or just have a trusted driver model so each 'driver' can share without IRPs (I/O Request Packets) which is how drivers usually communicate with other drivers and the Windows kernel. This would work in Series X because Microsoft are only dirver-supplier and they can't be replaced by the user. This cannot work in Windows because drivers come from all sources and IRPs service a resilience and security need.

The subsystem used as an example was Win32, so I guess DirectStorage on XSX would just be communicating directly with the I/O system services, or straight to the I/O Manager, with no intermediate use of a PC style "main ram" (PC DDR) driver, or the PCIe bus driver, because you're just going straight into the single pool of GDDR6...?

Edit: I mean ... that's a lot of steps!

And this is the point. Doing anything in Windows (or macOS or linux) has a tremendous amount of overhead. Not in terms of your PC using a signifiant amount of CPU time, although it can happen, but folks think of their PC as one machine but the reality is Windows holds together a whole bunch of different components that pass data to and from each other with absolutely no understanding of what the other components are or even what the data is. And this is the only way that the PC hardware ecosystem can work.

But seriously, gold star for trying to dig into this stuff. It's a lot to take in but it also makes you appreciate Windows a bit more! :yes:

There's not really a Northbridge in modern PC platforms and the Southbridge (generally just referred to as the Chipset these days) doesn't enter the equation on Zen2 based platforms since the NVMe SSD connects directly to the CPU IO die just like it would in the PS5.

Agreed but the distinction of there being two buses with their own bus handling drivers continues. You can put a ton of functionality into a single ASIC but that does not remove the need for different drivers hooked into the right part of Windows for it all to work.

If AMD were to enable direct communication between the GPU and SSD via the HBCC (which the hardware already supports but isn't enabled in drivers) then the path from SSD to GPU would be basically the same as the console as far as I can see (aside from one additional very high bandwidth PCIe-16x bus). At least for any data that doesn't need to be decompressed before going into graphics memory anyway.

I don't follow AMD hardware in the PC world, but I wonder what the reason is for this? I can think of many but it would be speculation, from a for starters you really don't want one driver managing two very different types of hardware resources. There are different type of kernel driver optimised towards different types of device handling and GPU and SSD seem as far flung as it's possible to get in terms of I/O priority, IRP priority. Windows manages this itself, so if you do want to do this you now have to second guess the Windows driver management system. Also do you want a hickup with your graphics card impact your storage system? I'd argue not.

AMD do seem best placed to achieve this, they have the closest end-to-end architecture solution of anybody out there, including Intel. But what they want to do has to work within the architecture of Windows itself. Some of the diagrams @function posted from Microsoft's site look crazy, but this is four decades of evolution for Windows being architecture like this. It would take tossing this model out and starting over to 'fix' the issue on PC. I say 'fix' because PC is not broken, it's just that in one aspect - SSD being able to serve data to CPU and GPU, the simplicity of nextgen console hardware is better .
 
Last edited by a moderator:
In the debug mesh view, you can see that most triangles are near-pixel siezed, but not really single pixel precisely. Most of them vary from 2-4 pixels in height/width
I guess the debug view is made by taking the atomic max of the triangle ID. Becasue the max, most pixels end up white, implying they have overdraw like you would get from pixel splatting to ensure there are no holes.
 
Sorry, can I ask a really basic question (well one I think is basic)?!

"How much bandwidth does it take to fill a 1440K resolution with 8K textures for a whole second?" I ask this because we're told this demo had 8K textures only and was running at 1440K...or am I totally misunderstanding?

Also, there's too many pages to trawl through, but has everyone seen the cherno video about this? I found it interesting...nicely explained so even I understand lol
 
Okay, you didn't mean a specific model designed for lower quality.

My guess, as it comes to me while typing, is there has to be a spatial representation, some hierarchy of nodes representing the 3D space. I think this would be shared with Lumen for optimising the light, although they may have two different spatial model for different purposes. Nodes will point to the mesh data on storage through a caching system. Viewing frustrum analysis with prediction will prefetch nodes-worth of geometry. The representation in storage may have LOD tiers, allowing several tiers above the full model to be fetched as an approximation, which I guess is what you're suggesting too.* Depending on how Epic manage this, the streaming requirements in terms of raw BW could be very low. For scaling the game to lower-spec'd devices, the full models make no sense, so I reckon there's a one-button export that simply stops the 'cooking' at a higher tier of LOD. If each tier has four leaves, say, one tier higher would be 1/4 of the object size and one triangle per four pixels, as it were, and two tiers higher would be 1/16th the size.

As for the large pixels in the demo, that was very obvious and they didn't subdivide over time. I took it to be Epic being a bit loose with their definition of 'one triangle per pixel' and there are plenty of larger objects in the source material, because it's quite frankly dumb to tessellate an area of triangles into small triangle if they all lie on the same plane. So perhaps more a case that the scene can be one triangle per pixel, but with the assets they were using, some triangle are much larger?

* The problem I have with this is the inefficiency in fetching data that should be local on the storage but may be spread over different blocks. The requirements for efficient data access might determine what representations are actually usable. Having a super-fast IO platform for R&D and prototyping would definitely be helpful in that regard.

Yeah, our assumptions match.
You can derive many LODs from a single pool of verts, by just changing the indexing (Highest LOD uses all verts, further LODs use subsets of the same verts just triangulated with progressively coarser geo) Isomniac for one did use that for some of their PS3 games, and even a Moto GP game on Xbox OG did the same with their track models. It's baffling to me that I haven't heard of more devs going that route.

Even though that does reduce data dramaticaly, the holy grail would be for this sytem to be able to only pull from storage the verts that do end up used on the LOD required at the moment, and the data from further LODs is pulled progressively on top of that. Man, that would be one hell of an elegant system.

I also don't discard the possibility that they provide tooling to merge multiple of the instanced meshes into a single low poly one (also generated automagically) for very extreme distances.
 
@milk What if each model is stored on disk as meshlets (like with Nvidia, Directx mesh shaders), so each meshlet has its own hierarchy and you can coarsely select which meshlets to load. I suppose you get the case where a meshlets are smaller than pixels, then you still need a way to reduce multiple meshlets down to something closer to 1 polygon per pixel.

meshlets_sample.png
 
Meshmaps (geometry texture) - an array of meshlets stored in a texture. Call its equivalent texel a 'microcel'.

Goal is to have a microcel approach 1:1 triangle to pixel for micropolygons in REYES.

Use visibility tests to only shade what is needed and save bandwidth.

Shade your geometry surface using meshmaps and draw the microcel. Use your special normal map created from LOD0 to guide you, is my guestimate.
 
I am very LTTP due to some stuff happening but i only just watched the demo, i did not have access to my main screen and decided i did not want to watch it for the first time on my phone...

And maybe it's built up the expectations way too much after reading peoples reaction to it but... I don't think it's very mindblowing at all?
Yeah, it looks really good but nothing *sits* in the environment? And nothing pops, either?
 
Last edited:
Back
Top