RSX aren't capable of HDR+FSAA?

Inane_Dork said:
These are just theoretical reasons. Heck, they may not even be possible:

1) You don't want to spend the memory on all the buffers.
As nAo mentioned, you don't have to have seperate memory for seperate buffer types. Simple allocate the largest type and re-use the memory with various format changes.
Memory aliasing is of course largely a console thing but it essentially removes this issue almost compelely. Especially when you consider you can get 2 NAO32 in the same memory space as 1 FP16 buffer...
Inane_Dork said:
2) Rendering everything to FP16 buffers is not the biggest bottleneck.
Even if not the bottleneck, it still takes (at least) twice the bandwidth which on most hardware is a concern (you can never have enough bandwidth). Also most hardware's ROP produce FP16 fragments are (at least) half rate and things like framebuffer compression are often disabled (which is why the bandwidth cost is usually >2x).
 
Fafalada said:
There are actually compelling reasons(that are unrelated to HDR and/or any specific hardware) to render fillrate guzzling particle sprites like smokes etc. into offscreen targets and combine them with opaque stuff in a latter pass.
In which case you could also render them in a different HDR space, one that may even support fixed hw blending.
Especially when you consider most particles into MSAA buffers are not getting AAed.
MSAA only does polygon edges, most particles are alphablended/tested so you never see any polygon edges...
Kind of defeating the point (and cost) of MSAA
 
nAo said:
This one surprises me. Without blending, FP16 is majorly lamed. If so, there really would be no reason to use it instead of NAO32 for opaque objects.

Yes, you would need to use ROP blending units for that..but they can't blend on FP16
I was under the impression that the sister chip did the AA resolve upon writing contents to RAM. I guess I was wrong.



DeanoC said:
As nAo mentioned, you don't have to have seperate memory for seperate buffer types. Simple allocate the largest type and re-use the memory with various format changes.
I understand this somewhat, but wouldn't you need two non-overlapping sections of memory for the NAO32 and the FP16 buffers? If they overlapped, you might write data to the FP16 buffer which overwrites a portion of the NAO32 buffer, which could screw up future pixels.

Even if not the bottleneck, it still takes (at least) twice the bandwidth which on most hardware is a concern (you can never have enough bandwidth). Also most hardware's ROP produce FP16 fragments are (at least) half rate and things like framebuffer compression are often disabled (which is why the bandwidth cost is usually >2x).
I definitely agree it's a cost worth considering. I was mostly thinking that if you did a Z pre-pass, you'd only write opaque color info once. If you write to NAO32 once and then copy to FP16, why not render that once to FP16 and save all implied resources?

But I assume that's so simple and obvious that it's missing important details. :p
 
DeanoC said:
MSAA only does polygon edges, most particles are alphablended/tested so you never see any polygon edges...
Kind of defeating the point (and cost) of MSAA
And in a lot of soft particle situations, I expect a quarter size buffer could be used and upscaled to fit the screen. In fact a quick test shows a 16/th resolution image could be used even, with a blur filter applied after upscaling if there's any banding going on.

There's lots of scope to mix up methods and resources to optimize each step of the rendering process. Use of a single buffer, single format, is unlikely to be the most efficient solution for any game.
 
Inane_Dork said:
I definitely agree it's a cost worth considering. I was mostly thinking that if you did a Z pre-pass, you'd only write opaque color info once. If you write to NAO32 once and then copy to FP16, why not render that once to FP16 and save all implied resources?
Because the opaque geometry is far and away most prevalent. The savings on that one step could free up lots of BW for other tasks. If you're wanting to get the most out of a machine, saying 'this works and is simple and is only one step' isn't going to achieve that. You want to divide everything into separate processes and focus on optimizing those where possible. Don't lump opaque and transparent rendering into one step if by dividing it into several, opaque, smoke, foliage, you can tackle each of those optimally and get save a hefty load of resources. Plus there's the quality benefit of non-RGB rendering for opaque lighting where even if all things were equal, it would be worth considering depending on your lighting engine.
 
Shifty Geezer said:
Because the opaque geometry is far and away most prevalent.
I don't believe this at all, or we'd be seeing much higher framerates in current games. ATI and NVidia have produced hardware that truly does get near it's peak capability, and some quick calculations shows you that simple opaque rendering is a piece of cake for even budget cards. In a X1600XT, for example, each pixel pipe can spend 40 cycles per pixel on a 1MP screen @ 60fps. Divide for overdraw (which should be minimal with rough object sorting) and you still have tons of cycles left. Remember, you can execute a tex lookup, 3 vec3, and 3 scalar instructions per cycle also!

Transparent objects can gobble rendering time because there's so much overdraw that can't be eliminated with a Z prepass. Many, if not most, of framerate-dipping special effects like smoke, fire, explosions, etc. require alpha blending.

IMO, "far and away" is rather inaccurate.
 
nAo and DeanoC, most of your points are good, and I used many of the same points back when the X800 series was introduced without FP blending. However, I've lost faith in these arguments because of what I'm seeing with HDR support in current PC games. Maybe the developers are a tad lazy, and don't want to special case these effects. Maybe it has something to do with having the engine support both HDR and non-HDR rendering modes. I just assumed that blending was too essential for practical HDR usage rather than it being one of these other reasons.

I can see how LDR effects like smoke can be blended in a second 8-bit per channel buffer and combined afterwards, but that severly contrains your choices. Even FarCry-esque foliage would need to store HDR values for plants in the sun.

A few other points:
DeanoC said:
What I meant is that for opaque geometry there is no good reason to use a HDR RGB colour space except lack of shader instructions. Given you can always convert to another space for alpha, why would you use HDR RGB?
Well, because conversion isn't free. For the whole screen you need a NAO32 read, conversion, and FP16 write on top of the original NAO32 conversion and write. For opaque rendering, real overdraw should be minimal, especially with a z-prepass, so just sticking with FP16 from the beginning will often need less bandwidth. Even with 1.5x real overdraw this will be the case (since 2.5*64 < 2.5*32+32+64). Furthermore, ping-pong with NAO32 should require the same bandwidth as blending with FP16 due to the copy step in the former.

The only way I can see you gaining anything is if your scene requirements are such that you don't need a fullscreen FP16 buffer to handle your alpha blending. I'd love a counterexample if you can think of one. :???:

MrWibble said:
In which case, I see no reason you should resolve your AA buffer (random colour-space or otherwise) down to a normal one in an RGB colour space (but retain HDR) and then render your translucent geometry as usual.
I agree, but you'll still be blending in an HDR buffer, which is the point I was making. AA is another matter.
EDIT: Just remembered one more important thing: Your Z-buffer matches the original AA buffer, and can't be used in the resolved buffer. Transparent objects still need to be occluded by other things in the scene. Therefore this is not an option.


IMO it is AA that tips the scales in favour of NAO32, not bandwidth. Once you take into account the extra copying and conversion, I have a hard time believing you get any bandwidth savings at all over FP16.
 
Last edited by a moderator:
Inane_Dork said:
I definitely agree it's a cost worth considering. I was mostly thinking that if you did a Z pre-pass, you'd only write opaque color info once. If you write to NAO32 once and then copy to FP16, why not render that once to FP16 and save all implied resources?

But I assume that's so simple and obvious that it's missing important details. :p
Answered my own question (and found that Mintmaster had already posted it). You could AA your NAO32 buffer and resolve it upon copying to the FP16 buffer. The FP16 buffer really doesn't need to be AAed, I think.



Shifty Geezer said:
Because the opaque geometry is far and away most prevalent. The savings on that one step could free up lots of BW for other tasks. If you're wanting to get the most out of a machine, saying 'this works and is simple and is only one step' isn't going to achieve that. You want to divide everything into separate processes and focus on optimizing those where possible. Don't lump opaque and transparent rendering into one step if by dividing it into several, opaque, smoke, foliage, you can tackle each of those optimally and get save a hefty load of resources.
I don't think I suggested anything against that. I was just saying that if you have a Z prepass, your opaque overdraw should be very small. Hence, rendering to NAO32 is a rather unnecessary in-between stage. If you skip to FP16, you eliminate filling the NAO32 buffer and you also eliminate reading from it.

That's my theory, anyway.
 
dukmahsik said:
so we can conclude that yes RSX can do a form of HDR and a form of AA at the sametime?
Since WarHawk is apparently doing both, it seems as though that is the logical conclusion.
 
PSman said:
Well, I heard that Xenos is capable of doing HDR+FSAA without any hit on performance

It's capable of getting lower performance hits than normal, but that's because they spent part of their transistor budget on EDRAM, that could've been used elsewhere. So again, nothing is free.
 
Mintmaster said:
Well, because conversion isn't free. For the whole screen you need a NAO32 read, conversion, and FP16 write on top of the original NAO32 conversion and write. For opaque rendering, real overdraw should be minimal, especially with a z-prepass, so just sticking with FP16 from the beginning will often need less bandwidth.
Even with 1.5x real overdraw this will be the case (since 2.5*64 < 2.5*32+32+64). Furthermore, ping-pong with NAO32 should require the same bandwidth as blending with FP16 due to the copy step in the former.
On paper your computation is correct, in the real world, imho, it's not.
I believe modern GPUs are still more efficient at handling 32 bit colors reads from/writes to memory, moreover a full screen pass would be way more efficient at using the available bandwidth than a color pass. I'm not saying you're wrong, though I'm not sure that disequation is correct in the real world, you can't add bandwidth that way.
We have also consider that a FP16 frame color buffer has to be cleared as well,so you should factor in an additional 4 bytes per pixel cost and what about fast clears on FP16 render targets? ;)
IMO it is AA that tips the scales in favour of NAO32, not bandwidth. Once you take into account the extra copying and conversion, I have a hard time believing you get any bandwidth savings at all over FP16.
AA in this case is a big win, no doubt about it, though I remember that first time I played with 32 bit pixel HDR rendering and no AA on some early dev kit and it was sensibly faster than FP16 (and yep, there was a z-pre pass too :) )
 
Mintmaster said:
A few other points:
Well, because conversion isn't free. For the whole screen you need a NAO32 read, conversion, and FP16 write on top of the original NAO32 conversion and write. For opaque rendering, real overdraw should be minimal, especially with a z-prepass, so just sticking with FP16 from the beginning will often need less bandwidth. Even with 1.5x real overdraw this will be the case (since 2.5*64 < 2.5*32+32+64). Furthermore, ping-pong with NAO32 should require the same bandwidth as blending with FP16 due to the copy step in the former.
But its not just bandwidth that costing you with FP16... Its that ROP units HATE (with a passion) FP16. Take a 7800GTX, effective rate no-blending is - 16 ARGB8 pixels, 8 FP16. Thats seperate from extra bandwidth, lack of compression, no fast clears etc.
If we assume a z-prepass for perfect 1 hit colour writes, the conversion will cost 1/2 a framebuffer worth of writes (as you wrote the opaque buffer at twice the speed but the FP16 conversion will be still at the half ROP rate).
Most hardware has a relatively low triangle setup rate, so the z-prepass isn't nessecarly a good idea and then the case for NAO32 gets even stronger... I.e. Lets say overdraw 1.5x (If we normalise the FP16 case of 8 ROPS to 1, we get FP16 = 1.5 (1.5*1) NAO32 = 1.75 (1.5*0.5+ 1*1) ). So its now just an extra 1/4 of framebuffer writes, which will reduce the more overdraw we have...

In practise (at least on our data...) all the other stuff that is lost (and hard to get figures on) with FP16 adds up very quickly.

Ignoring MSAA, with a z-prepass it might just be worth using FP16 over NAO32 but turn on MSAA and turn off a z-prepass and the balence moves well into NAO32s favour IMHO
 
Last edited by a moderator:
DeanoC said:
But its not just bandwidth that costing you with FP16... Its that ROP units HATE (with a passion) FP16. Take a 7800GTX, effective rate no-blending is - 16 ARGB8 pixels, 8 FP16. Thats seperate from extra bandwidth, lack of compression, no fast clears etc.
nAo said:
We have also consider that a FP16 frame color buffer has to be cleared as well,so you should factor in an additional 4 bytes per pixel cost and what about fast clears on FP16 render targets? ;)
Why are you guys clearing the color buffer??? Don't you touch each pixel with at least a skybox? Fast clears apply to the Z-buffer AFAIK. Color compression only applies for AA, which isn't supported with FP16 anyway, so that's a separate argument. If you want AA, FP16 isn't an option.

DeanoC, RSX would likely only realistically fill 5 ARGB8 pixels per clock with Z enabled (see the B3D 7600GT review), so the fillrate reduction is moot since bandwidth is the limiting factor. Half ROP rate has no consequence here. It's the bandwidth that matters.

nAo said:
On paper your computation is correct, in the real world, imho, it's not.
I believe modern GPUs are still more efficient at handling 32 bit colors reads from/writes to memory, moreover a full screen pass would be way more efficient at using the available bandwidth than a color pass. I'm not saying you're wrong, though I'm not sure that disequation is correct in the real world, you can't add bandwidth that way.
I know the GF6 series (and hence probably the early PS3 dev kits) had a problem with 64-bit writes, and NVidia even has a paper on their site about using a pair of 2-channel FP16 textures instead of a 4-channel FP16 texture for improved efficiency during HDR rendering. I assumed GF7 and RSX improved here, but maybe not.

As for not being able to "add bandwidth that way", note that this way emphasizes the advantages of NAO32 the most. Once you throw in all the other factors, they reduce the impact of NAO32. I'm basically saying that while RSX is writing the extra 4 bytes per pixel for FP16, the rest of the chip is completely stalled. See my next post for a more clear explanation.

DeanoC said:
If we assume a z-prepass for perfect 1 hit colour writes, the conversion will cost 1/2 a framebuffer worth of writes (as you wrote the opaque buffer at twice the speed but the FP16 conversion will be still at the half ROP rate).
Most hardware has a relatively low triangle setup rate, so the z-prepass isn't nessecarly a good idea and then the case for NAO32 gets even stronger... I.e. Lets say overdraw 1.5x (If we normalise the FP16 case of 8 ROPS to 1, we get FP16 = 1.5 (1.5*1) NAO32 = 1.75 (1.5*0.5+ 1*1) ). So its now just an extra 1/4 of framebuffer writes, which will reduce the more overdraw we have...
I really have no idea what you're talking about here. Are you making the argument with respect to ROP rate? RSX will never reach a rate of 8 pix/clk even for ARGB8.

In practise (at least on our data...) all the other stuff that is lost (and hard to get figures on) with FP16 adds up very quickly.

Ignoring MSAA, with a z-prepass it might just be worth using FP16 over NAO32 but turn on MSAA and turn off a z-prepass and the balence moves well into NAO32s favour IMHO
Well, my calculations ignored a z-prepass. 1.5x real overdraw (2.5x using your terminology) is a lot when hardware has top of the pipe Z reject. As for MSAA, well, you can't have FP16 with so NAO32 is the only option.


Just want to clear something up: I'm not dissing NAO32 or the work you've done. The quality improvement from being able to enable AA is awesome. I'm just saying that the performance advantage, if present, speaks more about NVidia's FP16 deficiencies. I'm sure some people in this forum think NAO32 halves the bandwidth requirement of HDR and is a perfect substitute for FP16, when things aren't so simple.
 
To summarize everything, let me do another more general calculation for a 1MP screen.

You have X opaque pixels written per frame after Z-reject. Assume the extra 4 bytes of data written per pixel (for using FP16 instead of NAO32) only operates at 50% efficiency and stalls the entire pipeline. RSX has 41 bytes of data access per cycle at its disposal. Assume the conversion pass for NAO32 to FP16 runs at 100% memory efficiency, with no shader bottleneck. Sound fair? ;)

So with this worst case FP16 writing, the opaque rendering costs 4*X/(50%*41) = 0.19X cycles more over NAO32.

The ideal conversion pass needs a NAO32 read and a FP16 write. That's 1M*(4+8)/(100%*41) = 0.3M cycles.

The result: To go from 55fps to 60fps, you'd have to render 6 million opaque pixels per frame, each pixel passing the z-test. To go from 29fps to 30fps, it's 5MP. That's a lot of overdraw for a 1MP screen. You can throw in off-screen buffers for dynamic cube maps or reflections or whatever, but they'll only affect the result so much.

NAO32 is excellent for AA, and great if you don't need to convert to FP16 for blending. Otherwise, IMO the improvement over FP16 is very marginal. Unless FP16 rendering is even more crippled on RSX than my assumptions above, I just don't see the benefit.
 
Mintmaster said:
I really have no idea what you're talking about here. Are you making the argument with respect to ROP rate? RSX will never reach a rate of 8 pix/clk even for ARGB8.
But RSX isn't a 7600GT it can magic up more bandwidth due to the split memory pools if bandwidth was the limit... ROP rate is a much more interesting figure IMO
Colour compression does work in non-AA mode BTW its just not very effective (for obvious reasons).

Mintmaster said:
Just want to clear something up: I'm not dissing NAO32 or the work you've done. The quality improvement from being able to enable AA is awesome. I'm just saying that the performance advantage, if present, speaks more about NVidia's FP16 deficiencies. I'm sure some people in this forum think NAO32 halves the bandwidth requirement of HDR and is a perfect substitute for FP16, when things aren't so simple.
No we understand your not dissing it, its always good to discuss pro's and con's of things. We all come and chat on boards to discuss things with good technical peeps, I've certainly had to think more about it here and have seen that FP16 might be closer/better in theory than I first thought.

But as you say, its designed for MSAA and in the non-AA case it may well not be worth it (I never really thought about it that hard, nAo worked it out to get AA and I assume that no AA isn't acceptable these days so didn't really think about it that much...).
 
Back
Top