Xenos/C1 and Deferred Rendering (G-Buffer)

Assuming you have 4 64bit buffer and the shader are not that complex in the deferred pass shifts the load towards the reading bandwidth. compared to normal forward rendering with mostly compressed textures.
Sure, but like I said, RSX has to use more BW during that part. When reading the G-buffer for a light, it also has to read the current value to accumulate it and then write the new value to RAM as well. This gets multiplied for multiple lights per pixel (which is likely since you decided on DR instead of FR).

4x64-bit is rather excessive. KZ2 is 4x32 plus Z, AFAIK.

the flaw in your idea is that you cannot transfer out any zbuffer optimization the hardware has build in. the stencil-/zbuffer without those is not a performance win as you probably know.
Says who? When you read a Z-buffer from RAM into the EDRAM, the HiZ is updated as well.
 
Says who? When you read a Z-buffer from RAM into the EDRAM, the HiZ is updated as well.

HiZ gets update automatically only if you output oDepth from the Pixel Shader which is excessively slow. There are ways to bring the Z-buffer as RGBA texture and then manually update the HiZ, but this is certainly not straightforward.
 
HiZ gets update automatically only if you output oDepth from the Pixel Shader which is excessively slow. There are ways to bring the Z-buffer as RGBA texture and then manually update the HiZ, but this is certainly not straightforward.
http://download.microsoft.com/downl...324bad9b1/Xbox 360 GPU Performance Update.zip

It doesn't look that complicated to do it the fast way, and the easy way really isn't that slow anyway (0.35 ms for 720p, and twice that for 2xAA is still only 2% of your frame time at 30fps).
 
Sure, but like I said, RSX has to use more BW during that part. When reading the G-buffer for a light, it also has to read the current value to accumulate it and then write the new value to RAM as well. This gets multiplied for multiple lights per pixel (which is likely since you decided on DR instead of FR).
blending is cheap compared to the transfer of 8times the amount data for the g-buffer.

4x64-bit is rather excessive. KZ2 is 4x32 plus Z, AFAIK.
at the same time they say g-buffer+depth+rgba8 is 36MB for them, with 4x32 it should be just about 15MB+depth+rgba8

Says who?
I said that, and it's still true, you cannot write out the optimizations.

When you read a Z-buffer from RAM into the EDRAM, the HiZ is updated as well.
just with an extra pass which is extra cost. in the end you can do everything on any hardware, but the cost rise for every emulation. here half ms for update the hiz, there for downloading rendertargets, here some more passes for tiles.

that's why the x360 is not really suitable for deferred. I didn't say it's impossible, you're just wasting time that you could better use in simple forward rendering.
 
blending is cheap compared to the transfer of 8times the amount data for the g-buffer.
What math are you using? Your 4x64b example is ridiculous.

KZ2 is 5x32bit for the g-buffer, and we're not going to see much more than that this gen due to the memory it gobbles up. Blending is 2x32-bit. The latter is a very significant savings - a lot more than the paltry 2% frame time for the Z-buffer load you're complaining about.

at the same time they say g-buffer+depth+rgba8 is 36MB for them, with 4x32 it should be just about 15MB+depth+rgba8
Read the KZ2 presentation again, and don't forget about 2xAA.

just with an extra pass which is extra cost.
Read the slides I linked. Not a big deal.

in the end you can do everything on any hardware, but the cost rise for every emulation. here half ms for update the hiz, there for downloading rendertargets, here some more passes for tiles.

that's why the x360 is not really suitable for deferred. I didn't say it's impossible, you're just wasting time that you could better use in simple forward rendering.
You can't ignore the pros and just look a the cons. 360 will use less BW during the g-buffer creation stage and even more so during the lighting stage. In each pass, geometry processing is substantially faster on 360, and "more passes for tiles" is misleading because you have much less geometry in each pass.

I already said that tiling is the only additional con for deferred vs. forward on 360. That's it, though. All the other cons of deferred rendering waste just as much time on PS3.

So the rendertarget+zbuffer is like 14MB. that would be even less suitable for X360, but yeah, can be done in two passes.
There's no extra cost for more passes here, because there's no geometry except for simple light volumes.
 
Last edited by a moderator:
I didn't say it's impossible, you're just wasting time that you could better use in simple forward rendering.

Simple forward rendering doesn't seem to cut it these days. Multipass forward rendering, sure, but that's not any simpler than deferred.
It's possible to fit enough G-buffer information in the EDRAM to have a fully capable deferred renderer.
Take for example the Insomniac approach:
Store just Normal and Spec Power. Then do the light accumulation and store separate diffuse and specular terms. Then in a final pass do the albedo and specular intensity modulation.
So, for the G-buffer you'd need just one 4-channel FP16 render target, and then for the light accumulation you'd need 2 RGBA8 render targets. The EDRAM can fit that for a slightly non-HD resolution of 1200x720. I can certainly live with that.
There are many other ways to setup your deferred renderer, including the light indexing method (LIDR) from another thread on this forum, or straight up deferred with albedo and 8bit storage for normals and spec with compression etc, etc.
 
What math are you using?
I assumed 36MB of KZ2 is 4*64bit buffers, sorry for not explaining that in detail, I didn't expect anyone could not understand that. (thanks to Thowllly for the hint it's 2AA)

KZ2 is 5x32bit for the g-buffer, and we're not going to see much more than that this gen due to the memory it gobbles up. Blending is 2x32-bit. The latter is a very significant savings - a lot more than the paltry 2% frame time for the Z-buffer load you're complaining about.
it doesn't save anything more than other architecutres save, it's just overhead.


You can't ignore the pros and just look a the cons. 360 will use less BW during the g-buffer creation stage and even more so during the lighting stage. In each pass, geometry processing is substantially faster on 360, and "more passes for tiles" is misleading because you have much less geometry in each pass.
it's not for free, you're having a huge overhead to process each pass, even if you reject geometry, you're still doing all the other setup, which is suboptimal.

I already said that tiling is the only additional con for deferred vs. forward on 360. That's it, though. All the other cons of deferred rendering waste just as much time on PS3.
and all I said is the same, tiling is a big overhead for deffered rendering, not giving you much benefit over normal forward rendering on x360.

There's no extra cost for more passes here, because there's no geometry except for simple light volumes.
Yes, that's what i'm talking about all the time. there are quite some cost for reloading the zbuffer at least twice (for deffered and for alpha-forward rendering), the G-Buffer dump, and other reloading of buffers.
compared to simple forward rendering I dont see enough advantage compared to the overhead.
 
Simple forward rendering doesn't seem to cut it these days. Multipass forward rendering, sure, but that's not any simpler than deferred.
that's what I said in the quote, x360 is very good for simple forward rendering. even with multiple passes it does quite a good job. it must have already a lot of load to not handle the rendering in 30ms, but in that case you won't do any better with the deffered overhead.
It's possible to fit enough G-buffer information in the EDRAM to have a fully capable deferred renderer.
Take for example the Insomniac approach:
Store just Normal and Spec Power. Then do the light accumulation and store separate diffuse and specular terms. Then in a final pass do the albedo and specular intensity modulation.
So, for the G-buffer you'd need just one 4-channel FP16 render target, and then for the light accumulation you'd need 2 RGBA8 render targets. The EDRAM can fit that for a slightly non-HD resolution of 1200x720. I can certainly live with that.
There are many other ways to setup your deferred renderer, including the light indexing method (LIDR) from another thread on this forum, or straight up deferred with albedo and 8bit storage for normals and spec with compression etc, etc.
Yes, a light prepass is nice, it's described in all details there: http://diaryofagraphicsprogrammer.blogspot.com/

if you lower the resolution, you can of course save the tiled overhead. but 1280*720@2AA with this approach might hurt you more than the benefits were.
 
I assumed 36MB of KZ2 is 4*64bit buffers, sorry for not explaining that in detail, I didn't expect anyone could not understand that.
If your assumption is wrong then how can anyone be expected to understand it?

it doesn't save anything more than other architecutres save, it's just overhead.
What is that supposed to mean? You're basically saying, "It doesn't save anything but it does". It's not just overhead, it's an additional load equal to 40% of the big g-buffer BW load that you're going on about per pixel per light.

it's not for free, you're having a huge overhead to process each pass, even if you reject geometry, you're still doing all the other setup, which is suboptimal.
You're only duplicating setup work for geometry in objects that overlap both tiles. Everything else is identical to FR.

and all I said is the same, tiling is a big overhead for deffered rendering, not giving you much benefit over normal forward rendering on x360.
No, I said that tiling is the only disadvantage. You've been arguing about other disadvantages of DR on 360.

Yes, that's what i'm talking about all the time. there are quite some cost for reloading the zbuffer at least twice (for deffered and for alpha-forward rendering), the G-Buffer dump, and other reloading of buffers.
This is a silly way to do things. Alpha geometry can be rendered between tiles after all the lights are accumulated, just like people always do with tiled forward rendering. You already have to sort alpha geometry, so flagging tiles is extremely low cost by comparison.

compared to simple forward rendering I dont see enough advantage compared to the overhead.
Comparing DR and FR is not that simple. If you have a lot of local dynamic lights, it doesn't matter if you need more tiling, because DR can still be faster. At the same time, you can have situations where even on RSX, DR will be slower because it needs so much more BW per pixel.

if you lower the resolution, you can of course save the tiled overhead. but 1280*720@2AA with this approach might hurt you more than the benefits were.
You don't get it. Even with forward rendering you need two tiles for 1280x720 @ 2xAA. Barbarian's suggestion has the same tiling requirement.

Basically, this method of DR has no extra tiling overhead. It does require two full passes of geometry, though, even on RSX, but many people do a z-prepass anyway in FR and DR so this isn't really a disadvantage.
 
Basically, this method of DR has no extra tiling overhead. It does require two full passes of geometry, though, even on RSX, but many people do a z-prepass anyway in FR and DR so this isn't really a disadvantage.

Exactly. The point being that Insomniac (PS3 only developer) chose a set of tradeoffs for their PS3 DR engine that happen to work perfectly on Xbox360 without modifications to fit in EDRAM, suggesting that Xbox360 is not any less suited for DR rendering.
 
If your assumption is wrong then how can anyone be expected to understand it?
sorry, I dunno why other understand it and can tell me that i'm wrong, but you did not understand. I'll try to talk more simple.

What is that supposed to mean? You're basically saying, "It doesn't save anything but it does".
maybe you dont get it cause you interpret to much into my words. I just said, it doesnt save more than on other architectures, but to have it you have additional work, additional work that you dont need to do on other systems is called overhead.

It's not just overhead, it's an additional load equal to 40% of the big g-buffer BW load that you're going on about per pixel per light.
additional work, but it's not overhead, sure ;).



You're only duplicating setup work for geometry in objects that overlap both tiles. Everything else is identical to FR.
if you use tiled rendering, you dont setup two command buffers, you have one buffer that is executed twice, even if the object dont overlap all tiles, they command-buffer needs to be executed to have the right states all the time. just the draw-calls are skipped, not their setup.

No, I said that tiling is the only disadvantage. You've been arguing about other disadvantages of DR on 360.
no? i didn't say that tiling and dumping edram to mainmemory is a problem? I dont know what you understand in my first post of this thread, but that was what I tried to say.

This is a silly way to do things. Alpha geometry can be rendered between tiles after all the lights are accumulated, just like people always do with tiled forward rendering. You already have to sort alpha geometry, so flagging tiles is extremely low cost by comparison.
I thought you refered to KZ2's way of doing it, so it's impossible to make object motion blur with alphablended stuff already on the shaded geometry. the only way I know is
render gbuffer
shade
motionblur
alpha
it's just silly if you dont think about it.


Comparing DR and FR is not that simple. If you have a lot of local dynamic lights, it doesn't matter if you need more tiling, because DR can still be faster. At the same time, you can have situations where even on RSX, DR will be slower because it needs so much more BW per pixel.
having local dynamic lights is the best case for DR, worst case would be global lights, handling global lights Deferred is a real bandwidth waste, while local lights would just touch the areas that are affected by them. but again, I still think it would run faster without DR on the x360. assigning x-lights, having a shader with dynamic branching so the mount of loops for lights is dynamic shouldn't be slower, if your geometry is of reasonable size. rendering the whole scene in one drawcall with x lights would be of course maximum overhead..

You don't get it. Even with forward rendering you need two tiles for 1280x720 @ 2xAA. Barbarian's suggestion has the same tiling requirement.
so why the heck:oops: is he saying
So, for the G-buffer you'd need just one 4-channel FP16 render target, and then for the light accumulation you'd need 2 RGBA8 render targets.The EDRAM can fit that for a slightly non-HD resolution of 1200x720.
you're right, i dont get it, I thought he was talking about "a slightly non-HD resolution" so 'The EDRAM can fit" it.


Basically, this method of DR has no extra tiling overhead. It does require two full passes of geometry, though, even on RSX, but many people do a z-prepass anyway in FR and DR so this isn't really a disadvantage.
just in case you use lower resolution, but again "if you lower the resolution, you can of course save the tiled overhead" you still have other overhead due to the xbox architecture.
And like I explained above, there is a limited amount of lights affecting the objects and they can be very good handled by the shaderunits. you won't save much doing that deferred, probably less than you'd add overhead.
I think on RSX it's the other way around, generating all the needed shaders or using dynamic branching is a big hit, while deferred rendering simplifies the rendering by avoiding this problems and having less overhead because of no reloading x-times the buffers is supposed to be faster in most cases.
 
maybe you dont get it cause you interpret to much into my words. I just said, it doesnt save more than on other architectures, but to have it you have additional work, additional work that you dont need to do on other systems is called overhead.
But you're wrong because it does save more - 64 bits per light per pixel. That's a lot!

if you use tiled rendering, you dont setup two command buffers, you have one buffer that is executed twice, even if the object dont overlap all tiles, they command-buffer needs to be executed to have the right states all the time. just the draw-calls are skipped, not their setup.
If the draw call is skipped, then all vertex shading and triangle setup is skipped. As for renderstates, if you're worried about that in a deferred renderer then you have severely misplaced priorities. I don't even think you could measure that overhead unless you have a really crappy engine.

no? i didn't say that tiling and dumping edram to mainmemory is a problem? I dont know what you understand in my first post of this thread, but that was what I tried to say.
You're not reading my post. I am not denying that tiling is a cost, nor am I denying that you mentioned it. What I am saying is that I feel it is the only issue (in particular the IQ sacrifice compared to the same number of tiles with forward rendering), whereas you are putting too much emphasis on other issues like external BW, renderstates, z-buffer issues, etc.

I thought you refered to KZ2's way of doing it, so it's impossible to make object motion blur with alphablended stuff already on the shaded geometry. the only way I know is
render gbuffer
shade
motionblur
alpha
it's just silly if you dont think about it.
You're not going to notice a big difference in image quality if you motion blur after blending. First of all, KZ2 is using rather rudimentary motion blur in that polygon edges aren't blurred like Lost Planet's advanced technique. Secondly, their alpha buffer compositing already results in artifacts. Finally, there is very little fine detail in their alpha effects, so incidental blur won't be very detrimental.

having local dynamic lights is the best case for DR, worst case would be global lights, handling global lights Deferred is a real bandwidth waste, while local lights would just touch the areas that are affected by them. but again, I still think it would run faster without DR on the x360. assigning x-lights, having a shader with dynamic branching so the mount of loops for lights is dynamic shouldn't be slower, if your geometry is of reasonable size. rendering the whole scene in one drawcall with x lights would be of course maximum overhead..
I actually agree with you here. Dynamic branching is another factor that helps 360 with variable-light forward rendering, and even though I forgot about that in this thread, I've argued the same thing with nAo and others.

In terms of simplification, though, DR is still king if you don't need material flexibility.

so why the heck:oops: is he saying
you're right, i dont get it, I thought he was talking about "a slightly non-HD resolution" so 'The EDRAM can fit" it.
The point is that with Uncharted type DR, 1200x720 needs no tiling, and 1200x720@2xAA needs 2 tiles, and both scenarios are the same with FR. Only if you really, really want those 5% more pixels does this form of DR need more tiling than FR. Also, KZ2 uses screen-space normals, so you don't need 4xFP16. 2xFP16 +1x8-bit is enough for normal + specpower, and you don't need to reduce resolution.
 
Last edited by a moderator:
But you're wrong because it does save more - 64 bits per light per pixel. That's a lot!
aren't we talking about reloading z and z-optimizations at this point? if not, i'm lost.

If the draw call is skipped, then all vertex shading and triangle setup is skipped. As for renderstates, if you're worried about that in a deferred renderer then you have severely misplaced priorities. I don't even think you could measure that overhead unless you have a really crappy engine.
I thought we were talking about all the tiles that you get when rendering those 4 1280*720*2AA MRT.
I wouldn't even think about the commandbuffer for the deferred pass, you'd have a crappy engine if you'd mind some drawcalls of two triangles, there is like zero pipeline overhead and invisible pixels aren't even processed;)

You're not reading my post. I am not denying that tiling is a cost, nor am I denying that you mentioned it. What I am saying is that I feel it is the only issue (in particular the IQ sacrifice compared to the same number of tiles with forward rendering), whereas you are putting too much emphasis on other issues like external BW, renderstates, z-buffer issues, etc.
I'm just saying that the cost of those BW, Renderstates... is higher than the benefits due to the good x360 FR ability. that's all.

You're not going to notice a big difference in image quality if you motion blur after blending. First of all, KZ2 is using rather rudimentary motion blur in that polygon edges aren't blurred like Lost Planet's advanced technique. Secondly, their alpha buffer compositing already results in artifacts. Finally, there is very little fine detail in their alpha effects, so incidental blur won't be very detrimental.
maybe you should check the guerilla paper that you wanted me to check, motionblur is just a postprocessing effect, and they don't save any objectIDs, so they blur whatever is on screen (but it seems like they have a cutoff, maybe using the depth). I've tried using motionblur after drawing alphaobject (cause my lazyness hoped it will work that easy), but it was completely fucked, especially if the alpha objects have some intensity to cover the objects that will move, then it looked like the alphasurface had some preditors moving infront of them.

I actually agree with you here. Dynamic branching is another factor that helps 360 with variable-light forward rendering, and even though I forgot about that in this thread, I've argued the same thing with nAo and others.
like I said, you just interpret too much into my writing;), i'm not saying the x360 is bad, just that it's good suitable for FR, no benefits from DR and there is barely any point doing more work for no benefit.

In terms of simplification, though, DR is still king if you don't need material flexibility.
actually, DR is good for simplyfying lightpasses (like we both agree it's anyway not an issue for the x360 shaderunits), but it's very bad for materials as you need either some kind of suboptimal general shader, emulating all kind of materials approximatelly or an uebershader with all possiblie materials and materialIDs in the G-Buffer. additionally you've a very limited set of individual material inputs. no diffuse/specular color, anisotropic, subsurface scattering with e.g. skin layer, or for plants.


The point is that with Uncharted type DR, 1200x720 needs no tiling, and 1200x720@2xAA needs 2 tiles, and both scenarios are the same with FR. Only if you really, really want those 5% more pixels does this form of DR need more tiling than FR. Also, KZ2 uses screen-space normals, so you don't need 4xFP16. 2xFP16 +1x8-bit is enough for normal + specpower, and you don't need to reduce resolution.
yeah, that's a possible workaround for one issue with DR on x360. you're right. (btw. u dont need the 1*8-bit for normals;) )
 
yeah, that's a possible workaround for one issue with DR on x360. you're right. (btw. u dont need the 1*8-bit for normals;) )

If you're referring to the fact that you can store just 2 components of the normal because the third can be reconstructed - this is only partially correct. First the reconstruction is relatively expensive and second you still need the correct sign EVEN in view space (interpolated vertex normals and normal maps basically through any assumptions about the normal out the window).
 
aren't we talking about reloading z and z-optimizations at this point? if not, i'm lost.
Nope, I'm talking about what I emphasized at the beginning of these posts:
http://forum.beyond3d.com/showpost.php?p=1168313&postcount=41
http://forum.beyond3d.com/showpost.php?p=1169671&postcount=47

Your replies on this matter eventually devolved into discussions on overhead. First, you agreed with me that BW is not the limiting factor in G-buffer creation (since ROP speed is), but then you said it is a big factor in the lighting part. However, 360 saves substantial BW here.

I thought we were talking about all the tiles that you get when rendering those 4 1280*720*2AA MRT.
I am. Command buffer overhead really shouldn't be a problem unless you have a terrible engine. Pixel shader states aren't going to vary much compared to a FR, as every pixel has to output similar basic G-buffer properties.

I'm just saying that the cost of those BW, Renderstates... is higher than the benefits due to the good x360 FR ability. that's all.
Now that you put it that way and are considering dynamic branching too, I say that this is very possible. This is quite different from you original posts on the subject, though, where you were talking about why PS3 is better suited to DR.

maybe you should check the guerilla paper that you wanted me to check, motionblur is just a postprocessing effect, and they don't save any objectIDs, so they blur whatever is on screen (but it seems like they have a cutoff, maybe using the depth).
I know, and that's why I said it's rudimentary.
I've tried using motionblur after drawing alphaobject (cause my lazyness hoped it will work that easy), but it was completely fucked, especially if the alpha objects have some intensity to cover the objects that will move, then it looked like the alphasurface had some preditors moving infront of them.
Well, it depends on a lot of things, like I said. You probably didn't have a quarter resolution alpha buffer composited on top with artifacts and filled with soft alpha particles.

Anyway, I've said again and again that Z copying is cheap, and proved it with data from that MS presentation.

yeah, that's a possible workaround for one issue with DR on x360. you're right. (btw. u dont need the 1*8-bit for normals;) )
The 1x8-bit was for specular power, as per Barbarian's original post on the technique.
 
If you're referring to the fact that you can store just 2 components of the normal because the third can be reconstructed - this is only partially correct. First the reconstruction is relatively expensive and second you still need the correct sign EVEN in view space (interpolated vertex normals and normal maps basically through any assumptions about the normal out the window).
Well, KZ2 is doing it, so that's why I threw it out there. As for the sign, you can just clamp the normal's view direction component before renormalization. It shouldn't be pointing that way anyway, so image quality won't be negatively affected.

Alternatively, you could use that specpower channel to store the sign of the world-space normal's z component.
 
Here's an offline normal compression scheme that compresses normals to just 17-bits with essentially no visible quality loss. Java3D normal compression

It turns the restoration process into lookup into a 2000 element table however. If you're willing to suffer a negligible loss of quality, you could probably get this down to a 256 or 512 entry table.
 
i'm talking about that :/
Mintmaster said:
Says who? When you read a Z-buffer from RAM into the EDRAM, the HiZ is updated as well.

Your replies on this matter eventually devolved into discussions on overhead. First, you agreed with me that BW is not the limiting factor in G-buffer creation (since ROP speed is), but then you said it is a big factor in the lighting part. However, 360 saves substantial BW here.
and I said, it doesn't save more than other architectures do. but it has an overhead reloading the z-buffer/hiz

I am. Command buffer overhead really shouldn't be a problem unless you have a terrible engine. Pixel shader states aren't going to vary much compared to a FR, as every pixel has to output similar basic G-buffer properties.
sadly it is a noticeable overhead, it's not just moving with your get ptr along the commands, you need to push them down the pipe and this takes time, especially shader binaries and constants. the commandbuffer-processor isn't made for high performance, cause usually you're not limited at that point. but seting dozen of states for drawcalls that are not issued will make your gpu idle, waiting for the next drawcall.

Now that you put it that way and are considering dynamic branching too, I say that this is very possible. This is quite different from you original posts on the subject, though, where you were talking about why PS3 is better suited to DR.
I was talking the other way around, why DR is well suited for PS3, why it's not benefit for x360 (all the time), but you seem to interpret a lot more into my text than I actuall write.

I know, and that's why I said it's rudimentary.
Well, it depends on a lot of things, like I said. You probably didn't have a quarter resolution alpha buffer composited on top with artifacts and filled with soft alpha particles.
particles are just part of the equation, there are other 'solid' alpha objects.
but yeah, I had soft particles and they are kinda lower resolution (there are faster ways than a quarter res buffer). but that makes no difference if you've some troopers moving in some distance behind the alpha objects, it will make some kind of "preditor stealth effect".

Anyway, I've said again and again that Z copying is cheap, and proved it with data from that MS presentation.
and it doesn't change that it's still overhead, even if it would be just 1-2ms that you waste movie buffer, it's an overhead you don't have on other machines and you might not save with DR over FR rendering.
 
Back
Top