Xenos/C1 and Deferred Rendering (G-Buffer)

3dcgi · Jun 22, 2008

Mintmaster said:
Oh, neat. So basically every draw call in the Z-pass is an occlusion query with a different ID#, and it still writes Z. In later passes, every draw call is conditionally rendered by referring to the appropriate ID#.

It's like predicated rendering on 360 based on Z-buffer visibility as opposed to tile visibility. I wonder if Microsoft can add this ability to predicated rendering transparently.

I didn't read more than the overview nAo linked, but it sounds like the predicated rendering that's part of DX10. I'm not sure about the "by region" mode though.

rapso · Jun 22, 2008

Mintmaster said:
I'd like to see you achieve 40GB/s read from RSX. Even straight transfer over Flex/IO won't get you an extra 20GB/s.

you won't get the peak 40gb/s as you won't get the 20gb/s on x360. but you can get as much % of 40gb/s as you can get out of 20gb/s on x360. (beside that, there are additonal benefits using both memories)

Moreover, how much time do you lose by rendering geometry directly into XDR during G-buffer creation? With that kind of random access writing it's not going to be pretty. I'm quite curious what render-to-XDR speed is.

telling u numbers ist against nda. but it might suprise you to hear that you might even win time if you render to XDR. it bepends on what you do and how well you balance the load.

You only transfer Z out of eDRAM once, and RSX has to write this too (uncompressed because it needs to be accessed as a texture later). 0.68 ms is the cost of two transfers - one for light accumulation, one for alpha.

if you render directly to the final memory, you can probably hide a lot of the time by e.g. shaderwork. it's not an extra pass that will cost you always the same.

Also, now that I think about it, you don't need the first copy. After you render the first tile of the G-buffer, you can keep the Z there and light it, and then move on to the next tile.

indeed, this way you have just the overhead of one reloading, if you dont do anything using neighbour pixels.

It should be as fast as memory allows. Possibly capped at 16GB/s, if I remember the schematic correctly. The total 30MB G-buffer will take 2 ms. Remember that RSX is also writing that same data to memory, but in a scattered and less efficient manner.

that's why you can use tiles on RSX like galopin mentioned. it will save you those 2ms on RSX side even with that 'scattered' writing. you have the same issue on eDRam during writing. it's not about bandwidth, but about latency switching addresses, a DRam issue. if you want to know more about those tiling, look at the homebrew doc for the psp, there is a nice visualization of that for the z-buffer.

rapso · Jun 22, 2008

galopin said:
One really helping friend on ps3 is conditional rendering, counting pixels from the front to back drawing of the z pass and use an index for each primitive to store it (thanks 2.20 to bring indices from 2048 to up to 1<<20)

that's what i'm playing around with currently as well

, regarding gcmreplay, perfect occlusionculling can save a big part of rendering (depending on scene), that also allowed me to implement sync this way, this results in no bandwidth (especially vmem-read) cost for sync. fap fap fap.

Half resolution particles: it's a big help and we use it for fire effect since it was a big feature on alone. Here again, rebuilding a zcull is 0.4ms on PS3, and less than 0.8ms on X360 (outputing point sprite of 4x4 size to initialise HiZ)

i wonder why you need an additional reload for that? you can render halfres to the normal rendertarget

I also wonder if that's a typo, 0.8ms on x360 and 0.4ms on ps3? do you mean 0.08ms on x360? or is this the cost of dump+resolve to lowres-zbuffer+reload?

X360 unified shader pipeline: DR is great because it use light weight shaders for Filling MRT, and i can give few GPR to vertex shader to give real speed up to pixels (AITD's balance is 32/96)

Tiled surface sample in the lighting pass: for a directional light, not a big issue, but when you have to light a spot or point light with weird primming on stencil (like alpha test folliage, it can put pressure on texture fetch more than you think)

if you made alpha-test in the zpass, you shouldn't need it in the shadingpass anymore. sometimes this might lead to artefacts, but in 99.9% it's just a speedup without artefacts.

D16 is'nt precise enough for your needs

I'd have expected, that in some situations (from what i've seen in AitD), like indoors, D16 might have been ok).

and finaly, one black point for DR on X360 or i'll be flag as a x360 fanboy: The cost of resolve operation that can be several ms since we have to resolve each fragment unmerged.

that's what bothers me with DR on x360 all the time

.

I wonder if you considered using just FR, and what aspects made you decide for DR. I guess it was more about flexibility and unified way of rendering on both systems, than about speed?

rapso · Jun 22, 2008

Mintmaster said:
The biggest problem with RGBE, I think, is dealing with blending, which is needed for light accumulation in DR.

I thought the blendstage is dealing with it 'for free'.
but I must admit, i didn't look into that deeply.

galopin · Jun 22, 2008

yes i was meaning 0.08ms, i don't count the z texture resolve since it's need for the light pass and some post effects, so we already have it.

I need to do a zcull "reload" from the depth texture on an half resolution depth buffer to allow zcull rejecting rop before shading, when you have a heavy fillrate bounds particle effect, it really save your life

Yes, Resolve is pain in the ass, deferred on current consoles are not ready for 60fps engine, but a real solution for 30fps games

RGBE+blending: since we store custom value in the alpha channel, hardware blending can't work anymore. to do additive, you must read the framebuffer, unpack, add the new light, pack again and store the full blended value yourself!

Central park had a huge amount of alpha tested trees, so we never put alpha test in the zpass. Add to that i switch to the low level api late (only this api would allow me to set a simplified shader that just read the alpha).

I warn all of you. no matter how many RTs are bound, if your shader output more than COLOR0, he'll suffer from performance drop off just like the RT was here

Resolves are evils, but sometimes it's great too.i over use the ability to overlap noAA RT to AA4x to do cheap downsample. AITD uses a lot of bluring effect, or some times full precision is'nt need, and it's far cheaper than full resolve and shader downsample

DR win against FR for a unique reason, some scenes push the light on screen to 200 Oo graphist are mads T_T

nAo · Jun 22, 2008

galopin said:
I need to do a zcull "reload" from the depth texture on an half resolution depth buffer to allow zcull rejecting rop before shading, when you have a heavy fillrate bounds particle effect, it really save your life

That's the 'slow path', but there is way to do it that gives a 100% correct zcull reload without having to perform any zcull reload.

RGBE+blending: since we store custom value in the alpha channel, hardware blending can't work anymore. to do additive, you must read the framebuffer, unpack, add the new light, pack again and store the full blended value yourself!

You can easily give up HDR alpha particles and composite your LDR (or fake HDR) transparent stuff in one go while storing your opaque pixels in a exotic render target format.

galopin · Jun 22, 2008

nAo said:
That's the 'slow path', but there is way to do it that gives a 100% correct zcull reload without having to perform any zcull reload.

it's more a zcull "laod" since it never exist at that resolution first. Maybe i can alias it with a 4xmsaa, but i don't have time to search more on this for the moment! And as my current knowledge and all talks on ps3.scedev.net don't reference others solutions!

nAo said:
You can easily give up HDR alpha particles and composite your LDR (or fake HDR) transparent stuff in one go while storing your opaque pixels in a exotic render target format.

I didn't talk about particles but light accumulation! particles are so bounds in any way that keep HDR is a nonsense ^^

nAo · Jun 23, 2008

galopin said:
it's more a zcull "laod" since it never exist at that resolution first. Maybe i can alias it with a 4xmsaa, but i don't have time to search more on this for the moment! And as my current knowledge and all talks on ps3.scedev.net don't reference others solutions!

No msaa or posts on devnet, use your imagination Luke!

I didn't talk about particles but light accumulation! particles are so bounds in any way that keep HDR is a nonsense ^^

Go deferred as ND did with Drake's fortune and you won't need to composite multiple lights (in fact they store colours using logluv as we did on HS a couple of years ago

)

Fafalada · Jun 23, 2008

Mintmaster said:
The biggest problem with RGBE, I think, is dealing with blending, which is needed for light accumulation in DR.

Light accumulation being screenspace operation, you can perform blending in-place through PS, so I don't think you're loosing anything relative to FP16. Obviously you'd still prefer to avoid accumulation in general though.

Mintmaster · Jun 23, 2008

Fafalada said:
Light accumulation being screenspace operation, you can perform blending in-place through PS, so I don't think you're loosing anything relative to FP16. Obviously you'd still prefer to avoid accumulation in general though.

Well you're losing culling ability, particularly when you have 200 on-screen local lights like galopin suggested. You can recover some of it by tiling multilight tiles like Uncharted, but it's not as good as stencil marked pixels.

rapso · Jun 23, 2008

galopin said:
mixing Local and Main memory for MRTs: Incompatible with tiling

are you sure? i think there is an optimization paper with hints about tiles.

Tiled surface sample in the lighting pass: for a directional light, not a big issue, but when you have to light a spot or point light with weird primming on stencil (like alpha test folliage, it can put pressure on texture fetch more than you think)

I wonder why stenciling might have an impact to the texture fetches, could you elaborate on that? would you recoment to not stencil primed surfaces?

galopin · Jun 23, 2008

rapso said:
are you sure? i think there is an optimization paper with hints about tiles.

Yes, i wont quote the rsx documentation, but i have the line just here saying that MRT is impossible if you mix 2 location for colors buffers. The restriction seems not affect possibility to put the depth buffer in main when color buffer in local. Need a test to be sure since documentation are far from accurate

rapso said:
I wonder why stenciling might have an impact to the texture fetches, could you elaborate on that? would you recoment to not stencil primed surfaces?

No, stencil is a real need (scull infact since real stencil done after shading). For performance issues, i only have the x360 background that show the texture bandwidth be far from optimal when stencil prim "ramdom" pixel with lot of small hole between them (it happens because i mark quads where the fragments really differs to do the msaa second fragment light only when usefull)

nAo: you're still talking about UDF DR, but i never saw a paper about their techniques!

Panajev2001a · Jun 23, 2008

http://www.naughtydog.com/corporate/press/GDC 2008/UnchartedTechGDC2008.pdf

.

Xenos/C1 and Deferred Rendering (G-Buffer)

3dcgi

rapso

rapso

rapso

galopin

nAo

Nutella Nutellae

galopin

nAo

Nutella Nutellae

Fafalada

Mintmaster

rapso

galopin

Panajev2001a

Similar threads