DirectX 12: The future of it within the console gaming space (specifically the XB1)

The tightest G-buffer layout is 12 bytes per pixel. That is two ARGB8 render targets plus a 32 bit per pixel depth+stencil buffer.

You store albedo.rgb and roughness to the first ARGB8 render target, and specular (grayscale) and normal.xyz to the second render target. You can store the normal with only two channels if you use some encoding scheme, but 8+8 bits provides slightly too bad quality for my taste. This way you free one 8 bit channel for some other use.

Chroma subsampling also frees one extra 8 bit channel per pixel (Cr and Cb are stored for every other pixel). We don't use this, because we have a third g-buffer layer on next gen consoles (total of 16 bytes per pixel).

In our case (Trials Evolution & Trials Fusion) we use 10-10-10-2 format (instead of ARGB8) as the second render target format. This gives 10+10 bits for the normal vector (four times the precision compared to 8 bit). We encode our normals with Lambert azimuthal equal-area projection. It's costs only a few ALU instructions to encode and decode. Roughness is stored to the third component with 10 bits precision (we use 2^x roughness for our physically based formula, so all the extra precision is welcome). The remainig 2 bit channel is used to store the selected lighting formula (four different formulas are supported).

If you use traditional (pixel shader based) deferred rendering. You also need to have a (HDR) lighting buffer in the memory at the same time as the g-buffer, as the lighting shader reads the g-buffer and writes to the lighting buffer. This consumes extra 8 bytes per pixel (ARGB16F). However with modern compute shader based lighting, you can do the lighting "in-place", meaning that you first read the g-buffer to the GPU local memory (LDS), do the lighting there, and output the result on top of the existing g-buffer. This way you can do pretty robust deferred rendering with just 16 bytes per pixel (eight 8 bit channels, three 10 bit channels, one 2 bit channel, 24 bit depth, 8 bit stencil).
 
Continued: Obviously on modern GPUs (full rate 64 bpp ROPs) you pack the two ARGB8 render targets to a single ARGB16 render target (to double the fill rate). This also makes it easier to write the output 16 bit float values on top of the existing buffer (no need to split the float values to 8 bit upper and lower parts).

There's also exist new deferred rendering techniques that do not store the color/material data to the g-buffer at all. By using a technique like this, you can fit even an 8xMSAA g-buffer to a very tight space. It's going to be interesting to see that kind of innovations all the big studios have came up with when the second wave of next gen games are launched. During the last generation we got a lot of new innovations: HDR lighting (including tone mapping and color grading), deferred shading (many variations), cascaded shadow mapping (and SDSM), variance based shadow filtering, post process antialiasing (many variations), physically based rendering, specular antialiasing (Toksvig and LEAN/CLEAN mapping) and many other new innovations that we almost universally adapted by all studios.
 
During the last generation we got a lot of new innovations: HDR lighting (including tone mapping and color grading), deferred shading (many variations), cascaded shadow mapping (and SDSM), variance based shadow filtering, post process antialiasing (many variations), physically based rendering, specular antialiasing (Toksvig and LEAN/CLEAN mapping) and many other new innovations that we almost universally adapted by all studios.
SIGGRAPH FTW! :D
 
This way you can do pretty robust deferred rendering with just 16 bytes per pixel (eight 8 bit channels, three 10 bit channels, one 2 bit channel, 24 bit depth, 8 bit stencil).

I'll just mention, at least on AMD hardware a 24bit depth buffer does not share space with stencil - stencil is always a separate buffer. So always use 32bpp depth.
 
The tightest G-buffer layout is 12 bytes per pixel. That is two ARGB8 render targets plus a 32 bit per pixel depth+stencil buffer.

You store albedo.rgb and roughness to the first ARGB8 render target, and specular (grayscale) and normal.xyz to the second render target. You can store the normal with only two channels if you use some encoding scheme, but 8+8 bits provides slightly too bad quality for my taste. This way you free one 8 bit channel for some other use.

Chroma subsampling also frees one extra 8 bit channel per pixel (Cr and Cb are stored for every other pixel). We don't use this, because we have a third g-buffer layer on next gen consoles (total of 16 bytes per pixel).

In our case (Trials Evolution & Trials Fusion) we use 10-10-10-2 format (instead of ARGB8) as the second render target format. This gives 10+10 bits for the normal vector (four times the precision compared to 8 bit). We encode our normals with Lambert azimuthal equal-area projection. It's costs only a few ALU instructions to encode and decode. Roughness is stored to the third component with 10 bits precision (we use 2^x roughness for our physically based formula, so all the extra precision is welcome). The remainig 2 bit channel is used to store the selected lighting formula (four different formulas are supported).

If you use traditional (pixel shader based) deferred rendering. You also need to have a (HDR) lighting buffer in the memory at the same time as the g-buffer, as the lighting shader reads the g-buffer and writes to the lighting buffer. This consumes extra 8 bytes per pixel (ARGB16F). However with modern compute shader based lighting, you can do the lighting "in-place", meaning that you first read the g-buffer to the GPU local memory (LDS), do the lighting there, and output the result on top of the existing g-buffer. This way you can do pretty robust deferred rendering with just 16 bytes per pixel (eight 8 bit channels, three 10 bit channels, one 2 bit channel, 24 bit depth, 8 bit stencil).

Obviously on modern GPUs (full rate 64 bpp ROPs) you pack the two ARGB8 render targets to a single ARGB16 render target (to double the fill rate). This also makes it easier to write the output 16 bit float values on top of the existing buffer (no need to split the float values to 8 bit upper and lower parts).

There's also exist new deferred rendering techniques that do not store the color/material data to the g-buffer at all. By using a technique like this, you can fit even an 8xMSAA g-buffer to a very tight space. It's going to be interesting to see that kind of innovations all the big studios have came up with when the second wave of next gen games are launched. During the last generation we got a lot of new innovations: HDR lighting (including tone mapping and color grading), deferred shading (many variations), cascaded shadow mapping (and SDSM), variance based shadow filtering, post process antialiasing (many variations), physically based rendering, specular antialiasing (Toksvig and LEAN/CLEAN mapping) and many other new innovations that we almost universally adapted by all studios.

If i may ask, sebbbi (don´t want you breaking any NDA, so feel free to deny a reply, just say you cannot reply ;) ):
What was it then that made you go for 900p (inicially 800p) for the Xbox One version of Trials Fusion when your target was 1080p for both consoles? With 16 bytes per pixel couldn´t you achieve 1080p? What problem did you encounter? And do you believe DX 12 can help with it?

And thank you for all the valuable info.
 
Last edited by a moderator:
I'll just mention, at least on AMD hardware a 24bit depth buffer does not share space with stencil - stencil is always a separate buffer. So always use 32bpp depth.
Also true on Intel FWIW. NVIDIA is the only IHV I know that actually packs the data, and I'm not even sure if they still do (probably have both options). APIs are a bit behind the times on this issue.
 
If i may ask, sebbbi (don´t want you breaking any NDA, so feel free to deny a reply, just say you cannot reply ;) ):
What was it then that made you go for 900p (inicially 800p) for the Xbox One version of Trials Fusion when your target was 1080p for both consoles? With 16 bytes per pixel couldn´t you achieve 1080p? What problem did you encounter? And do you believe DX 12 can help with it?

And thank you for all the valuable info.

This should explain most of it... http://www.eurogamer.net/articles/digitalfoundry-2014-trials-fusion-tech-interview
 
Question is:
Ryse is 900p 30 fps (barely)... Limited by the amount of esram.
Unfortunatly the use of CUs for compute Shaders or other GPGPU process requires fast RAM, and the Xbox One has limited availability since it is already limited for the current graphics framebuffer.

As such, with the use of GPGPU, esram availability may be a bigger problem in the future, and I see this observation from Projekt Red as almost a fact. Tiling is of course the solution, but with current development costs how many will use it?

I might be wrong here, but didn't Crytek stated that they went for 900p on Ryse, not because the esram size, but due available bandwidth? I think something they did in the lightning computation was really bandwidth hungry and for 1080p it would achieve impractical numbers, and evaluating the game at 1080p without the lighting and in 900p with that they decided 900p was the best choice.
 
Also true on Intel FWIW. NVIDIA is the only IHV I know that actually packs the data, and I'm not even sure if they still do (probably have both options). APIs are a bit behind the times on this issue.
Yeah, APIs are not yet up to date. But the situation after DX 10 is slightly better, since it added at least minimal support for floating point depth. Unfortunately PC DirectX API still doesn't have any support for 24 bit floating point depth (for those hardware that supports it). D24FS8 was a very nice format on Xbox 360. Inverse 24 bit float depth is very close to 32 bit integer depth in quality, and 32 bit inverse float depth is even better (very close to logarithmic 32 bit depth = best possible distribution). It's enough for huge depth distances (as long as you need... unless you need planetary scale with tiny details and huge zooms).

DXGI_FORMAT_D32_FLOAT_S8X24_UINT is 32 bit depth + 8 bit stencil on most hardware. So it doesn't actually take 64 bits per pixel (unless the hardware packs the depth and stencil together). This is slightly awkward to use on PC, because you can't separate the depth from the stencil for texturing (since the API hides the separate depth + stencil implementation, and pretends that both are interleaved together).
I'll just mention, at least on AMD hardware a 24bit depth buffer does not share space with stencil - stencil is always a separate buffer. So always use 32bpp depth.
Yes, definitely. Always use 32 bit float depth (and modify your projection matrix to store inverse depth values). Much much better than 24 bit integer depth. All our levels are made from small objects, and our level designers love to place the objects so that they intersect each other and the terrain. 24 bit integer depth looks pretty bad (as intersection seams tend to slightly wobble).

Unfortunately on OpenGL side, the depth values are mapped to [-1,1] range and this makes 32 bit float depth unusable (inverse depth trick doesn't work, and straight 32 bit float depth is worse than 32 bit integer depth). So if you need to port your game to Mac or Linux, you might run into some problems.
What was it then that made you go for 900p (inicially 800p) for the Xbox One version of Trials Fusion when your target was 1080p for both consoles? With 16 bytes per pixel couldn´t you achieve 1080p? What problem did you encounter? And do you believe DX 12 can help with it?
We didn't target 1080p. We targeted locked 60 fps (meaning that the game runs most of the time at 70-80 fps and dips down to 60 fps when heavily stressed). During most of our development time, we were 720p on both next gen platforms. Stable 60 fps was very important for our level designers during the development (the game is basically a physics based reaction game). Like I said in my Digital Foundry interview, at the end of the project we upgraded the resolutions as we finalized our engine optimizations. We were very happy to achieve 900p on Xbox One and 1080p on PS4. We didn't need to do any shader quality trade-offs on either platforms.

We weren't 16 bytes per pixel, because Trials Fusion was a cross generation game, and we had to support Xbox 360 and DirectX 10 PCs as well, so we still used pixel shader based lighting (tiled deferred lighting, but with CPU based vectorized light binning instead of a GPU based compute shader binning). Basically the tiling algorithm was directly ported from Xbox 360 to PC and other platforms. So we had a separate (HDR) lighting buffer, and that used some extra memory. We also had a 1 byte per pixel stencil buffer.
 
Last edited by a moderator:
We didn't target 1080p. We targeted locked 60 fps (meaning that the game runs most of the time at 70-80 fps and dips down to 60 fps when heavily stressed). During most of our development time, we were 720p on both next gen platforms. Stable 60 fps was very important for our level designers during the development (the game is basically a physics based reaction game). Like I said in my Digital Foundry interview, at the end of the project we upgraded the resolutions as we finalized our engine optimizations. We were very happy to achieve 900p on Xbox One and 1080p on PS4. We didn't need to do any shader quality trade-offs on either platforms.

We weren't 16 bytes per pixel, because Trials Fusion was a cross generation game, and we had to support Xbox 360 and DirectX 10 PCs as well, so we still used pixel shader based lighting (tiled deferred lighting, but with CPU based vectorized light binning instead of a GPU based compute shader binning). Basically the tiling algorithm was directly ported from Xbox 360 to PC and other platforms. So we had a separate (HDR) lighting buffer, and that used some extra memory. We also had a 1 byte per pixel stencil buffer.

so, if you don't mind, what was the main aspect of the system that prevented you from reaching 1080p?
Was it, the GPU, memory-bandwidth, memory-size (of the fast memory) or just some unpredictive slowdowns on the x1 (at least some games seem to have those unpredictive slowdowns).
 
DirectX 12: The future of it within the console gaming space (specifically th...

You'll need to infer from his second paragraph; the challenges he faced in making trials fusion.
 
There's also exist new deferred rendering techniques that do not store the color/material data to the g-buffer at all. By using a technique like this, you can fit even an 8xMSAA g-buffer to a very tight space.

I believe that users who follow your posts already imagine what that techniche might be and who is probably going to be implementing it, *wink.
 
so, if you don't mind, what was the main aspect of the system that prevented you from reaching 1080p?
Was it, the GPU, memory-bandwidth, memory-size (of the fast memory) or just some unpredictive slowdowns on the x1 (at least some games seem to have those unpredictive slowdowns).
The main issue was time (isn't it always :)).

We had a rather small team creating the game simultaneously for four platforms, two of them completely unknown for us (next gen). Optimizing the game to run at constant 60 fps on Xbox 360 took a considerable slice of the optimization resources, as we wanted to have the same game play experience on it. PC was also tricky for us, because the code base was designed for locked 60 fps (no frame skipping code) and hard coded for a six thread CPU. On PC, the GPU resource management was the biggest pain related to performance optimization. Our tech streams basically everything (meshes, terrain, vegetation, world, virtual texturing, etc) from the HDD. PC DirectX (11 and below) hasn't been designed for this kind of use (GPU manufacturers recommend loading everything to GPU memory during loading screens to avoid hiccups). PC discrete GPUs have their own memory pools and the drivers try to do their best to provide seamless data transfers (and memory defragmentation). It's hard to create (multithreaded) streaming code that works without any stalls on all the different PC hardware.

Fortunately DirectX 12 will provide lower level manual resource management and a better resource binding model. This is a godsend for games that do heavy data streaming and have lots of dynamic content.

If you look at other next gen "launch window" games, there aren't many games achieving both locked 60 fps and 1080p. Even the biggest first party titles designed for single console, such as Killzone Shadow Fall had to rely on tricks such as rendering every other scanline (aka "the STALKER method": http://forum.beyond3d.com/showthread.php?t=49802) to achieve 1080p at 60 fps on PS4. I think we succeeded quite well considering the cross platform nature of the game and our heavy focus on user created content (all levels done using in-game tools -> we cannot offline bake any lighting or shadows).
 
The main issue was time (isn't it always :)).
Fortunately DirectX 12 will provide lower level manual resource management and a better resource binding model. This is a godsend for games that do heavy data streaming and have lots of dynamic content.
.

Sebbbi

Off course DX 12 will be an improvement. But in comparison to what's expected in DX 12, how limited is the current Xbox One DX 11.2 altered low level API?
And how does the PS4 API behaves in comparison? Is he really so close to the metal as people claim it is?

Sorry for trying to squezze you as much info as we can ;)

PS: As always, don´t go breaking any NDA, so feel free not to answer.
 
The main issue was time (isn't it always :)).

I think we all assume resources are finite and fixed, the question is probably about hardware bottlenecks. If i asked why the PS3 version was not 1080P you would probably have an answer besides time. ;) Thanks for the info.
 
and hard coded for a six thread CPU.

Interesting. Was that driven by the 6 logical thread of Xenon, the 6 physical threads of the new consoles or all of the above? And how does that translate to PS3? Can those threads be run on the SPE's just like a standard CPU core?
 
Interesting. Was that driven by the 6 logical thread of Xenon, the 6 physical threads of the new consoles or all of the above? And how does that translate to PS3? Can those threads be run on the SPE's just like a standard CPU core?

Sebbbi already wrote that it was driven from the 6 logical threads of X360 indeed and that was rather easy to just port this 6 threads engine to 6 physical cores engine, with some optimizations for some parts, on both new consoles.

I think their previous old gen only game never had a PS3 version.

So, really, the X360 engine helped them to develop on the PS4! I assume it was the case for many others developers that took their 6 threads X360 code to help them on both new consoles.
 
Sebbbi already wrote that it was driven from the 6 logical threads of X360 indeed and that was rather easy to just port this 6 threads engine to 6 physical cores engine, with some optimizations for some parts, on both new consoles.

I think their previous old gen only game never had a PS3 version.

So, really, the X360 engine helped them to develop on the PS4! I assume it was the case for many others developers that took their 6 threads X360 code to help them on both new consoles.

Ah my bad, I missed that. I wonder what changes they had to make for 4 thread PC CPU's then and whether we're likely to see significantly better performance on 6 thread+ CPU's in the future on account of most future games being optimised for that number of x86 (albeit significantly weaker) threads.
 
sebbbi said:
Unfortunately on OpenGL side, the depth values are mapped to [-1,1] range and this makes 32 bit float depth unusable
Heh... OpenGL 4.5 fixed this problem yesterday (ARB_clip_control). It took some time, but better late than never :)
Off course DX 12 will be an improvement. But in comparison to what's expected in DX 12, how limited is the current Xbox One DX 11.2 altered low level API?
And how does the PS4 API behaves in comparison? Is he really so close to the metal as people claim it is?
I was talking about the PC situation (discrete GPU + driver managed memory handling). GPU resource management has never been a problem on any console platform I have worked on.

For the answer to your second question, I suggest reading Chris Norden (SCEA) inteview at Digital Foundry (especially the chapter: Low-level access and the "wrapper" graphics API):
http://www.eurogamer.net/articles/digitalfoundry-inside-playstation-4
 
Back
Top