The pros and cons of eDRAM/ESRAM in next-gen

Then rest of the games, are all forward rendering.
Okay, but where's your data from? Is it assumed that if a dev doesn't say 'we're rendering deferred' that they are using a forwards renderer, or is that info available from somewhere? (eg. Insider source from profiling).
 
Okay, but where's your data from? Is it assumed that if a dev doesn't say 'we're rendering deferred' that they are using a forwards renderer, or is that info available from somewhere? (eg. Insider source from profiling).

Pretty much that, if deferred rendering is the norm then the devs wouldn't go like "look we are doing deferred with hundreds of lights"
 
Okay, but where's your data from? Is it assumed that if a dev doesn't say 'we're rendering deferred' that they are using a forwards renderer, or is that info available from somewhere? (eg. Insider source from profiling).

Some developers have technical presentations on their games (e.g. Gears of War, Battlefield 3, Just Cause 2). And it was my mistake, deferred lighting and deferred shading a two different things. If you're not using deferred and not using a forward renderer. Are you using software rendering?
 
Pretty much that, if deferred rendering is the norm then the devs wouldn't go like "look we are doing deferred with hundreds of lights"
But most devs don't even talk about their engines. You're only hearing from a few choice AAA developers on their decisions. Unity has a toggle for forward/deferred rendering; it's as easy as a checkbox setting.

I wouldn't be at all so quick to claim deferred isn't popular if the source is only heresay on gaming websites. I've heard enough talk over recent years that the XB360 is having to deal with deferred framebuffers (lack of 'free' MSAA) to consider deferred, or at least light pre-pass, the de facto these days. Maybe it isn't, but we need real data and not a few high-profile news stories.
 
Deferred renderers are *very* common for next-gen titles, especially those in development. With the major middleware providers all using deferred rendering, games using forward rendering are very likely to be the minority from this point on (even considering games using Forward+/Tiled Forward/whatever you want to call it).

Back when we were in the prototype stage we were using a deferred renderer, with a tiled compute-based approach similar to what Frostbite uses. At the time we had a G-Buffer setup like this:

Lighting target: RGBA16f
Normals: RG16
Diffuse albedo + BRDF ID: RGBA8
Specular albedo + roughness: RGBA8
Tangents: RG16
Depth: D32

So if you're looking to target 1920x1080 with that setup, then you're talking about (8 + 4 + 4 + 4 + 4 + 4) * 1920 * 1080 = 55.3MB. On top of that we supported 16 shadow-casting lights which required 16 1024x1024 shadow maps in an array, plus 4 2048x2048 cascades for a directional light. That gives you 64MB of shadow maps + another 64MB of cascade maps, which you'll want to be reading from at the same time you're reading from your G-Buffers. Obviously some of these numbers are pretty extreme (we were still prototyping) and you could certainly reduce that a lot, but I wanted to give an idea of the upper bound on what an engine might want to be putting in ESRAM for their main render pass. However even without the shadows it doesn't really bode well for fitting all of your G-Buffers in 32MB at 1080p. Which means either decreasing resolution, or making some tough choices about which render targets (or which portions of render targets, if using tiled rendering) should live in ESRAM. Any kind of MSAA at 1080p also seems like a no-go for fitting in ESRAM, even for forward rendering. Just having a RGBA16f target + D32 depth buffer at 2xMSAA requires around 47.5MB at 1920x1080.
 
I personally makes the distinction between a full deferred (killzone) versus a deferred lighting (halo, unity). as I had mentioned earlier, I see deferred lighting being used more often than the former, don't disagree that more devs are using this technique.
 
Lighting target: RGBA16f
Normals: RG16
Diffuse albedo + BRDF ID: RGBA8
Specular albedo + roughness: RGBA8
Tangents: RG16
Depth: D32

6 RTs lol. Even if they could fit I think you'd use up virtually all the ESRAMs bandwidth (if not more). But how much of a performance hit are we expecting when placing RTs in the main X1 RAM (2 or 3)?
 
I personally makes the distinction between a full deferred (killzone) versus a deferred lighting (halo, unity). as I had mentioned earlier, I see deferred lighting being used more often than the former, don't disagree that more devs are using this technique.

Regardless, there's still a space issue when considering things aside from the main scene render targets...

On top of that we supported 16 shadow-casting lights which required 16 1024x1024 shadow maps in an array, plus 4 2048x2048 cascades for a directional light. That gives you 64MB of shadow maps + another 64MB of cascade maps, which you'll want to be reading from at the same time you're reading from your G-Buffers

Any thoughts on that tiled resource demo involving shadowmaps?
 
I personally makes the distinction between a full deferred (killzone) versus a deferred lighting (halo, unity). as I had mentioned earlier, I see deferred lighting being used more often than the former, don't disagree that more devs are using this technique.
Okay, but the context is about fitting buffers in ESRAM. Whether it's fully deferred or deferred lighting, once the buffers exceed the 32 MB of ESRAM, you hit the same problem. Half your ESRAM is taken with the colour render target at 1080p.
 
Okay, but the context is about fitting buffers in ESRAM. Whether it's fully deferred or deferred lighting, once the buffers exceed the 32 MB of ESRAM, you hit the same problem. Half your ESRAM is taken with the colour render target at 1080p.

What if I tell you that my question is not really a question and I was just too lazy to come up with the examples/numbers?
 
But how much of a performance hit are we expecting when placing RTs in the main X1 RAM (2 or 3)?

I couldn't really tell you, I've never actually worked with the XB1 hardware (I only work on PS4).

Any thoughts on that tiled resource demo involving shadowmaps?

Well, from what I recall that demo was more aimed at saving performance by caching a large "virtual" shadow map and then selectively pulling in certain tiles when needed for rendering (but I could be wrong, it's been a while since I looked at it). I would imagine that either way you would still need a rather large pool of resident shadow map memory since what's on-screen can easily pull from a lot of shadow map tiles. It would depend a lot on how many shadow-casting lights can be on-screen at once, and what level of texel tensity you want from the shadow maps.

Cool, some real world numbers. Do those all really need to be done on the same pass?

Probably not, if you bucketed by material type. For instance, tangents were only used by anisotropic materials and hair. But then of course you get to tradeoffs regarding how many times you want to read from the G-Buffer and write to the output target, and I would assume there would also be tradeoffs involved with moving RT data in and out of ESRAM.
 
Is it beause you are used I tilled system that you cannot recycle the same RT for all shadowmaps?
And aren't hybrid deferred engines ala Fox engine's or Destiny's where they build a full fat G-buffer, and still render lights to a separate light buffer anyways, and shade latter, a good fit for ONE? You can render your whole G-buffer on a single pass, but keep only Depth and normals and shadowmaps on ESRAM for lighting, then forget about all that, and keep only the lighting buffer and bring albedo, specular and roughfness back for the final shading pass.
 
I looked at Chipworks die shot of Xbox One chip and 32MB eSRAM is actually two 16MB eSRAM macros/blocks if I stand correct and Xbox One CPU has an L3 Cache/eSRAM between both dual module CPU's thus I wonder if that L3 Cache/eSRAM accessible directly to GPU and size of it...

In case it is available could it be possible to off-load some information there?
 
sebbbi for president!

The lighting step doesn't need extra 8 MB for the 4x16f RT, because compute shader can simultaneously read and write to the same resource, allowing you to to lighting "in-place", writing the output over the existing g-buffer. This is also very cache friendly since the read pulls the cache lines to L1 and the write thus never misses L1 (GCN has fully featured read & write caches).

Seems obvious now you mention it, but I'd not thought about this ...

This layout is highly efficient for both g-buffer rendering and lighting. And of course also for post processing since all your heavy data fits in the fast memory. Shadow maps obviously need to be sampled from main memory during the lighting, but this is actually a great idea since the lighting pass woudn't otherwise use any main memory BW at all (it would be completely unused = wasted).

And again about the use of the split BW.

How good have MS been about communicating this kind of information to developers? Or is it something that any developer worth their salt should already be well aware of and have implemented from their first game onwards?
 
How good have MS been about communicating this kind of information to developers? Or is it something that any developer worth their salt should already be well aware of and have implemented from their first game onwards?
That early I would think that be it for the PS4 or the XB1 as well as considering the potential market size developers may not have devoted much energy into customizing things for both systems. "The straight-forwardness" of the ps4 may cover that partly.

Now like every body I'm eagerly waiting for Sebbbi to share his pov on the matter :)
 
Last edited by a moderator:
MJPs g-buffer layout is actually only two RTs in the g-buffer rendering stage and one RT in the lighting stage. And a depth buffer of course. Quite normal stuff.

On GCN you want to pack your data to 64 bpp (4 x 16 bit integer) render targets because that doubles your fill rate compared to using more traditional 32 bpp RTs (GCN can do 64 bit filling at same ROP rate as 32 bit filling).

I assume that the packing is like this:
Gbuffer1 = normals + tangents (64 bit)
Gbuffer2 = diffuse + brdf + specular + roughness (64 bits)
Depth buffer (32 bits)

Without any modifications this takes 40 megabytes of memory (1080p).

The lighting step doesn't need extra 8 MB for the 4x16f RT, because compute shader can simultaneously read and write to the same resource, allowing you to to lighting "in-place", writing the output over the existing g-buffer. This is also very cache friendly since the read pulls the cache lines to L1 and the write thus never misses L1 (GCN has fully featured read & write caches).

It's also trivial to get this layout down to 32 MB from the 40 MB. Replace gbuffer1 with a 32 bit RT (32 MB target reached at 1080p). Store normal as 11+11 bit using lambert azimuth equal area projection. You can't see any quality difference. 5+5 bits for tangents is enough (4 bits for exponent = mip level + 1 bit mantissa). 11+11+5+5=32. Also if you only use the tangents for shadow mapping / other planar projections, you don't need them at all, since you can analytically calculate the derivatives from the stored normal vector.

This layout is highly efficient for both g-buffer rendering and lighting. And of course also for post processing since all your heavy data fits in the fast memory. Shadow maps obviously need to be sampled from main memory during the lighting, but this is actually a great idea since the lighting pass woudn't otherwise use any main memory BW at all (it would be completely unused = wasted).

Thanks !

But what about AA? Is it possible to have a proper AA solution alongside this stuff on eSRAM @1080p?
 
Thanks !

But what about AA? Is it possible to have a proper AA solution alongside this stuff on eSRAM @1080p?

If you mean MSAA, then I would say not likely at 1080p. MSAA doesn't really work with deferred rendering. I'd personally prefer advances in post-AA, like what Ryse is doing.

Correct me if I'm wrong, but 2x MSAA doubles the size of the depth buffer and 4x MSAA quadruples it.
 
Back
Top