Xenos/C1 and Deferred Rendering (G-Buffer)

rapso · Jun 18, 2008

Barbarian said:
If you're referring to the fact that you can store just 2 components of the normal because the third can be reconstructed - this is only partially correct. First the reconstruction is relatively expensive and second you still need the correct sign EVEN in view space (interpolated vertex normals and normal maps basically through any assumptions about the normal out the window).

no matter if you're saving the sign bit or not, you still need to reconstruct the normals from view space. beside that, in theory there shouldn't be an normals facing to the opposide of the screen, with bumpmapping it's possible, but clamping the screenspace-normal z to 0 is not noticeable most of the time.

MJP · Jun 18, 2008

Barbarian said:
If you're referring to the fact that you can store just 2 components of the normal because the third can be reconstructed - this is only partially correct. First the reconstruction is relatively expensive and second you still need the correct sign EVEN in view space (interpolated vertex normals and normal maps basically through any assumptions about the normal out the window).

As an alternative to storing view-space X and Y of the normal, you can store the theta and phi of the normal as a spherical coordinate. Makes packing and unpacking a little more expensive, though.

nAo · Jun 18, 2008

MJP said:
As an alternative to storing view-space X and Y of the normal, you can store the theta and phi of the normal as a spherical coordinate. Makes packing and unpacking a little more expensive, though.

Both can be implemented with lookup tables (a cube map to pack, and a 2D texture to unpack)

Mintmaster · Jun 18, 2008

rapso said:
and I said, it doesn't save more than other architectures do. but it has an overhead reloading the z-buffer/hiz

Why do you keep saying it "doesn't save" when I keep showing you the opposite? The accumulation buffer needs blending. That's 2x32b of RAM access per sample per light on PS3 and zero on 360.

Z-buffer reloading is 0.17ms per 720p copy. Even for 2xAA and doing it two times per frame, it's only 2% of frame time.

sadly it is a noticeable overhead

Well I'll just have to disagree with you here. A good engine will not notice this beyond a fraction of a percent in framerate.

I was talking the other way around, why DR is well suited for PS3, why it's not benefit for x360 (all the time), but you seem to interpret a lot more into my text than I actuall write.

Go read this post of yours and tell me what you see. You're saying DR will be faster on PS3 than on 360, plain and simple, due to higher theoretical RAM bandwidth.

particles are just part of the equation, there are other 'solid' alpha objects.

Okay, fine, you win. 360 needs another Z-copy. Instead of 1% overhead, it's 2%.

and it doesn't change that it's still overhead, even if it would be just 1-2ms that you waste movie buffer, it's an overhead you don't have on other machines and you might not save with DR over FR rendering.

You're exaggerating the cost. It's well under 1ms. Other advantages of EDRAM outweigh that cost several times over.

Barbarian · Jun 18, 2008

nAo said:
Both can be implemented with lookup tables (a cube map to pack, and a 2D texture to unpack)

That's an interesting idea. Angles have better distribution compared to the normal's axis projections. It might be possible to even store the angles as 8bit per channel in the G-buffer. This way the unpack texture would have to be a maximum of 256x256 3-channel FP16 or something more compressed.

galopin · Jun 20, 2008

Deferred rendering is possible. We use it on Alone in the dark. Memory grab of a patched version of GTA4 allow me to capture texture (render target too) and shaders that show a deferred rendering engine.

10M of edram only means that we'll have to do more tiles to do MRT rendering. More tiles means that geometry we'll be processed more times. But x360 take vertices without any problems. Deffered rendered are almost always fillrate and pixel processing bound.

Drawbacks of deferred renderer is:
1. Alphas that are difficult to integrate nicely with the deferred part
2. MSAA is really difficult
* on AITD, i manage to only lit twice the sample that are really differents
* on GTA4, i don't have the full process yet, but they generate a channel in one MRT with a really nice edge detection and do a final post effect to blur them

galopin · Jun 20, 2008

Also, a comparaison between ps3 and x360 about deferred:

Fillrates:
PS3: RSX can write 20GB/s
X360: a theoric 256GB/s, it's 8 // paths of 32GB/s that are use when you do MRT or MSAA, if you only use one RT without MSAA, you use 1/8 of the edram writing bandwidth so deferred

Additive lighting part:
PS3 alpha blend takes 1.8 times more than x360 since EDRAM implement free alpha (Set ALPHABLEND_ENABLE to false tells to hardware to do alphablend (ONE,ZERO)). Lights use a heavy pixel shader so not really a big issue here...

PS3: to achieve good fillrate, you have to use RT in tiled regions. But reading texture is better for swizlled texture

X360: 1010102F render target to do HR in 32bits, ps3 only have 16161616F, so ps3 lose more bandwith on that too

Shadowmaps: consoles does'nt support rendering to a R16F target, but x360 allow resolve from a R32F to a R16F. So reading shadow in the light pass is less texture limited on X360.

Zcull and Scull are less efficient than HiZ and HiStencil of X360, so we shade more useless pixel

fedormma2008 · Jun 20, 2008

galopin said:
Deferred rendering is possible. We use it on Alone in the dark. Memory grab of a patched version of GTA4 allow me to capture texture (render target too) and shaders that show a deferred rendering engine.

10M of edram only means that we'll have to do more tiles to do MRT rendering. More tiles means that geometry we'll be processed more times. But x360 take vertices without any problems. Deffered rendered are almost always fillrate and pixel processing bound.

Drawbacks of deferred renderer is:
1. Alphas that are difficult to integrate nicely with the deferred part
2. MSAA is really difficult
* on AITD, i manage to only lit twice the sample that are really differents
* on GTA4, i don't have the full process yet, but they generate a channel in one MRT with a really nice edge detection and do a final post effect to blur them

It is very nice to hear from this perspective.
Is the type of DR (Alone in the Dark, GTA4) the same as Killzone 2's?
What do you think about Killzone 2's engine? Why does it look so good?

galopin · Jun 20, 2008

Kill zone 2 is not here yet. And they don't talk about their resolution, they quickly fly over the AA too. So wait and see for it.

Their motion vector raise the RT count to 5, so fillrate and tex fetch at the light pass will increase. When we now that fragment processing on PS3 is about 15-20% more costly than 360 (i measure it myself with several pix and GCMreplay shot of dedicated test scene)

For your other question, DR is always a similar process, output needed datas, and use them at the lighting pass.

For exemple, we kept some lightmap, GTA doesn't, but we share a "preshadowmap" (only store at vertex level in GTA), it's a precomputed factor to modulate the lights ( can be seen as an ambiant occlusion factor).

We have a prelight RT were fall cubemap, emissive, lightmap and some other things. GTA store a vector data to do a global reflection step as a unique post process (they use a low res city with forward rendering to create the spherical reflection map)

rapso · Jun 20, 2008

Mintmaster said:
Why do you keep saying it "doesn't save" when I keep showing you the opposite?

so tell me what does the z-optimization save more on xbox360 than on other architectures, like e.g. ps3/RSX? (cause that's all I say all the time and you say it's wrong, I never doubted z-optimizations wont save anything on x360, I just say they dont save more than on other architectures, but require the reloading).

The accumulation buffer needs blending. That's 2x32b of RAM access per sample per light on PS3 and zero on 360.

so you have
5*32bit for G-Buffer reading + 0 for framebuffer on X360 of ~20GB/s
and
5*32Bit for G-buffer reading + 2*32Bit r/w for framebuffer on PS3 of ~40GB/s.

Z-buffer reloading is 0.17ms per 720p copy. Even for 2xAA and doing it two times per frame, it's only 2% of frame time.

with alpha done seperately for correct motion blur like KZ2 has it it's 0.68ms per frame more on x360 to reload it, probably the same time to save it (correct me if that's wrong). and this is just for Z. are there any public numbers available of timing for dumping buffers to mainmem? then we could calculate it for KZ2 kind of rendering, just for the sake of knowing what we argue about

Well I'll just have to disagree with you here. A good engine will not notice this beyond a fraction of a percent in framerate.

it's all a fraction of the framerate, but the sum of all this "tiny overheads" cost you some ms in the end, time you dont want to waste if you can get the same with FR on x360.

this post of yours and tell me what you see. You're saying DR will be faster on PS3 than on 360, plain and simple, due to higher theoretical RAM bandwidth.

sorry

, but maybe you should read more accurately what I write instead interpreting stuff in my words that I didn't said. I will just copy it.

you won't save much doing that deferred, probably less than you'd add overhead.
I think on RSX it's the other way around, generating all the needed shaders or using dynamic branching is a big hit, while deferred rendering simplifies the rendering by avoiding this problems and having less overhead because of no reloading x-times the buffers is supposed to be faster in most cases.

so again
DR on X360 will be probably slower than FW on 360.
DR on PS3 will be probably faster than FW on PS3.
(talking about some more complex lighting situations)

no comparision if X360 or PS3 is faster/slower, that's up to the fanboys. (which we both aren't

)

rapso · Jun 20, 2008

galopin said:
Also, a comparaison between ps3 and x360 about deferred:

Fillrates:
PS3: RSX can write 20GB/s
X360: a theoric 256GB/s, it's 8 // paths of 32GB/s that are use when you do MRT or MSAA, if you only use one RT without MSAA, you use 1/8 of the edram writing bandwidth so deferred

Additive lighting part:
PS3 alpha blend takes 1.8 times more than x360 since EDRAM implement free alpha (Set ALPHABLEND_ENABLE to false tells to hardware to do alphablend (ONE,ZERO)). Lights use a heavy pixel shader so not really a big issue here...

PS3: to achieve good fillrate, you have to use RT in tiled regions. But reading texture is better for swizlled texture

thx for that intersesting data

did you split rendertargets and GBuffer between VMem and MainMem?

X360: 1010102F render target to do HR in 32bits, ps3 only have 16161616F, so ps3 lose more bandwith on that too

hey, that sounds bad

, have you tried to get along with rgba8 on ps3? i'm just interested in how much performance u lose on being on 16161616F.

Shadowmaps: consoles does'nt support rendering to a R16F target, but x360 allow resolve from a R32F to a R16F. So reading shadow in the light pass is less texture limited on X360.

why didn't you use D16 ? precision issues?

Zcull and Scull are less efficient than HiZ and HiStencil of X360, so we shade more useless pixel

did you do any zcull optimization pass? did you use depth bounds?

nAo · Jun 20, 2008

galopin said:
PS3: RSX can write 20GB/s

Colour compression helps though.

PS3 alpha blend takes 1.8 times more than x360 since EDRAM implement free alpha (Set ALPHABLEND_ENABLE to false tells to hardware to do alphablend (ONE,ZERO)). Lights use a heavy pixel shader so not really a big issue here...

That's why I'm a big fan of DR ala Drake's fortune + low res particle effects.

PS3: to achieve good fillrate, you have to use RT in tiled regions. But reading texture is better for swizlled texture

GBuffers sampling is so regular that I'd be quite surprised to find out that tiled textures are any slower than swizzled textures.

X360: 1010102F render target to do HR in 32bits, ps3 only have 16161616F, so ps3 lose more bandwith on that too

You can always be a bit more creative at storing your HDR colours

Shadowmaps: consoles does'nt support rendering to a R16F target, but x360 allow resolve from a R32F to a R16F. So reading shadow in the light pass is less texture limited on X360.

Maybe you want to double check your documentation..

Zcull and Scull are less efficient than HiZ and HiStencil of X360, so we shade more useless pixel

ZCull reload for the win....

Mintmaster · Jun 20, 2008

rapso said:
so tell me what does the z-optimization save more on xbox360 than on other architectures, like e.g. ps3/RSX? (cause that's all I say all the time and you say it's wrong, I never doubted z-optimizations wont save anything on x360, I just say they dont save more than on other architectures, but require the reloading).

I wasn't talking about Z. I was talking about blending, and was very clear about it. This is from your own post:

rapso said:
Mintmaster said:

First, you agreed with me that BW is not the limiting factor in G-buffer creation (since ROP speed is), but then you said it is a big factor in the lighting part. However, 360 saves substantial BW here.

Click to expand...

and I said, it doesn't save more than other architectures do. but it has an overhead reloading the z-buffer/hiz

It's very clear. Now...

so you have
5*32bit for G-Buffer reading + 0 for framebuffer on X360 of ~20GB/s
and
5*32Bit for G-buffer reading + 2*32Bit r/w for framebuffer on PS3 of ~40GB/s.

I'd like to see you achieve 40GB/s read from RSX. Even straight transfer over Flex/IO won't get you an extra 20GB/s.

Moreover, how much time do you lose by rendering geometry directly into XDR during G-buffer creation? With that kind of random access writing it's not going to be pretty. I'm quite curious what render-to-XDR speed is.

with alpha done seperately for correct motion blur like KZ2 has it it's 0.68ms per frame more on x360 to reload it, probably the same time to save it (correct me if that's wrong).

You only transfer Z out of eDRAM once, and RSX has to write this too (uncompressed because it needs to be accessed as a texture later). 0.68 ms is the cost of two transfers - one for light accumulation, one for alpha.

Also, now that I think about it, you don't need the first copy. After you render the first tile of the G-buffer, you can keep the Z there and light it, and then move on to the next tile.

So we're back to 0.34 ms.

are there any public numbers available of timing for dumping buffers to mainmem?

It should be as fast as memory allows. Possibly capped at 16GB/s, if I remember the schematic correctly. The total 30MB G-buffer will take 2 ms. Remember that RSX is also writing that same data to memory, but in a scattered and less efficient manner.

sorry, but maybe you should read more accurately what I write instead interpreting stuff in my words that I didn't said. I will just copy it.

Yeah, you said that later, but you never recanted what you said earlier.

galopin · Jun 21, 2008

Various answers:

Color and Z compression: we does'nt use MSAA on PS3, so compression mode does'nt help in any case. docs say that sometime it can help a few. but not in your case.

mixing Local and Main memory for MRTs: Incompatible with tiling

64bits vs 32bits Rendertarget: we could do things like RGBE, but it's far from the quality of a real HR surface. For raw perf between the two, i does'nt remember stats from head

Zcull reload: Of course we do a Zprepass + Zcull reload to optimise the Zfar (and destroy Znear but early accept not really usefull).

One really helping friend on ps3 is conditional rendering, counting pixels from the front to back drawing of the z pass and use an index for each primitive to store it (thanks 2.20 to bring indices from 2048 to up to 1<<20)

Half resolution particles: it's a big help and we use it for fire effect since it was a big feature on alone. Here again, rebuilding a zcull is 0.4ms on PS3, and less than 0.08ms on X360 (outputing point sprite of 4x4 size to initialise HiZ)

X360 unified shader pipeline: DR is great because it use light weight shaders for Filling MRT, and i can give few GPR to vertex shader to give real speed up to pixels (AITD's balance is 32/96)

Tiled surface sample in the lighting pass: for a directional light, not a big issue, but when you have to light a spot or point light with weird primming on stencil (like alpha test folliage, it can put pressure on texture fetch more than you think)

D16 is'nt precise enough for your needs

and finaly, one black point for DR on X360 or i'll be flag as a x360 fanboy: The cost of resolve operation that can be several ms since we have to resolve each fragment unmerged.

Jesus2006 · Jun 21, 2008

Mintmaster said:
I'd like to see you achieve 40GB/s read from RSX. Even straight transfer over Flex/IO won't get you an extra 20GB/s.

Even if it's lower there's also the CPU that can consume a large amount of bandwidth out of that 20GB/s on 360, where on PS3 there is zero sharing on one pool (+ bandwidth of the additional pool).

So the bandwidth argument cannot be neglected here.

Mintmaster · Jun 21, 2008

galopin said:
mixing Local and Main memory for MRTs: Incompatible with tiling

What exactly are you talking about? Memory tiling of the rendertarget? If so, are texture fetches slow without it?

64bits vs 32bits Rendertarget: we could do things like RGBE, but it's far from the quality of a real HR surface. For raw perf between the two, i does'nt remember stats from head

The biggest problem with RGBE, I think, is dealing with blending, which is needed for light accumulation in DR. Uncharted divides the screen into tiles and determines which lights affect which tiles, and then runs a shader with N(i,j) lights on the (i,j)th tile. It's a bit complicated to manage and not as efficient in saving pixels as stencil volume culling, but blending to FP16 isn't cheap either.

One really helping friend on ps3 is conditional rendering, counting pixels from the front to back drawing of the z pass and use an index for each primitive to store it (thanks 2.20 to bring indices from 2048 to up to 1<<20)

Could you explain this in more detail? It sounds interesting, but it's hard for me to tell what you're doing here.

galopin · Jun 21, 2008

Mintmaster said:
What exactly are you talking about? Memory tiling of the rendertarget? If so, are texture fetches slow without it?

slower yes. their is several way to install surface, linear, swizlle and tiled. Tiled memory is for efficient writing. swizlled is for efficient reading and linear when you can't do the other two ^^ So on PS3, you can't never be optimal for every case. You have to make choice.

Mintmaster said:
The biggest problem with RGBE, I think, is dealing with blending, which is needed for light accumulation in DR.

I never said that it's easy ^^ But this is an option against the cost of FP16 blending.

Mintmaster said:
Uncharted divides the screen into tiles and determines which lights affect which tiles, and then runs a shader with N(i,j) lights on the (i,j)th tile. It's a bit complicated to manage and not as efficient in saving pixels as stencil volume culling, but blending to FP16 isn't cheap either.

I never saw a white paper about the UDF rendering pipe! Were did you find it?

Mintmaster said:
Could you explain this in more detail? It sounds interesting, but it's hard for me to tell what you're doing here.

*Snip* : I've removed this as you're talking about RSX which looks like NDA violation to me. As a precaution for your protection, I've removed the content just in case Sony ninjas see it as reason to target you. If you're happy to have what you posted up, and believe it's not going to adversely affect your employment prospects, let me know and I'll paste it back in.

nAo · Jun 21, 2008

Mintmaster said:
Could you explain this in more detail? It sounds interesting, but it's hard for me to tell what you're doing here.

This stuff is supported in OpenGL by NVIDIA through some extension.
Basically the number of pixel generated by a certain rendering command can be used to conditionally skip another rendering command.
If you have a z pre pass you can conditionally skip all those rendering commands that have not generated any pixel in the z pre pass.

Mintmaster · Jun 21, 2008

nAo said:
This stuff is supported in OpenGL by NVIDIA through some extension.
Basically the number of pixel generated by a certain rendering command can be used to conditionally skip another rendering command.
If you have a z pre pass you can conditionally skip all those rendering commands that have not generated any pixel in the z pre pass.

Oh, neat. So basically every draw call in the Z-pass is an occlusion query with a different ID#, and it still writes Z. In later passes, every draw call is conditionally rendered by referring to the appropriate ID#.

It's like predicated rendering on 360 based on Z-buffer visibility as opposed to tile visibility. I wonder if Microsoft can add this ability to predicated rendering transparently.

galopin · Jun 22, 2008

Mintmaster said:
Oh, neat. So basically every draw call in the Z-pass is an occlusion query with a different ID#, and it still writes Z. In later passes, every draw call is conditionally rendered by referring to the appropriate ID#.

It's like predicated rendering on 360 based on Z-buffer visibility as opposed to tile visibility. I wonder if Microsoft can add this ability to predicated rendering transparently.

Predicate tiling is just a way to read the command buffer several times. The automatic rejection of primitives on untouched tiles use a similar process than ps3.

This process is screen extends. It let the GPU build three pair of values. they describe (MinX,MaxX) (MinY,MaxY) and (MinZ,MaxZ). If a draw don't touch one dimension or is out of viewport, Min > Max

What do the graphics libs for you is register one screen extend per drawcall and write result somewhere. And after the first pass (tile 0 or one-pass-zpass), run a callback that will patch the command buffer on predicate operation based on the result.

I had to redo this myself (reversr engineering the screen extents command that was insert) because when i switch to the low level api, d3d could'nt do it himself.

Note that HiZ rejection need a one-pass-z-pass, so the maximum resolution to do it is 1280x720xMSAA2 (limited by the HizMemory size).

X360 GPU also implement conditional rendering, but is limited to 64 index, use only hiz and need some draw between survey and conditonal draw (if not enought time left, the survey is useless)

Xenos/C1 and Deferred Rendering (G-Buffer)

rapso

MJP

nAo

Nutella Nutellae

Mintmaster

Barbarian

galopin

galopin

fedormma2008

galopin

rapso

rapso

nAo

Nutella Nutellae

Mintmaster

galopin

Jesus2006

Mintmaster

galopin

nAo

Nutella Nutellae

Mintmaster

galopin

Similar threads