DeanoC said:
But its not just bandwidth that costing you with FP16... Its that ROP units HATE (with a passion) FP16. Take a 7800GTX, effective rate no-blending is - 16 ARGB8 pixels, 8 FP16. Thats seperate from extra bandwidth, lack of compression, no fast clears etc.
nAo said:
We have also consider that a FP16 frame color buffer has to be cleared as well,so you should factor in an additional 4 bytes per pixel cost and what about fast clears on FP16 render targets?
Why are you guys clearing the color buffer??? Don't you touch each pixel with at least a skybox? Fast clears apply to the Z-buffer AFAIK. Color compression only applies for AA, which isn't supported with FP16 anyway, so that's a separate argument. If you want AA, FP16 isn't an option.
DeanoC, RSX would likely only realistically fill 5 ARGB8 pixels per clock with Z enabled (see the B3D 7600GT review), so the fillrate reduction is moot since bandwidth is the limiting factor. Half ROP rate has no consequence here. It's the bandwidth that matters.
nAo said:
On paper your computation is correct, in the real world, imho, it's not.
I believe modern GPUs are still more efficient at handling 32 bit colors reads from/writes to memory, moreover a full screen pass would be way more efficient at using the available bandwidth than a color pass. I'm not saying you're wrong, though I'm not sure that disequation is correct in the real world, you can't add bandwidth that way.
I know the GF6 series (and hence probably the early PS3 dev kits) had a problem with 64-bit writes, and NVidia even has a paper on their site about using a pair of 2-channel FP16 textures instead of a 4-channel FP16 texture for improved efficiency during HDR rendering. I assumed GF7 and RSX improved here, but maybe not.
As for not being able to "add bandwidth that way", note that this way emphasizes the advantages of NAO32 the most. Once you throw in all the other factors, they reduce the impact of NAO32. I'm basically saying that while RSX is writing the extra 4 bytes per pixel for FP16, the rest of the chip is completely stalled. See my next post for a more clear explanation.
DeanoC said:
If we assume a z-prepass for perfect 1 hit colour writes, the conversion will cost 1/2 a framebuffer worth of writes (as you wrote the opaque buffer at twice the speed but the FP16 conversion will be still at the half ROP rate).
Most hardware has a relatively low triangle setup rate, so the z-prepass isn't nessecarly a good idea and then the case for NAO32 gets even stronger... I.e. Lets say overdraw 1.5x (If we normalise the FP16 case of 8 ROPS to 1, we get FP16 = 1.5 (1.5*1) NAO32 = 1.75 (1.5*0.5+ 1*1) ). So its now just an extra 1/4 of framebuffer writes, which will reduce the more overdraw we have...
I really have no idea what you're talking about here. Are you making the argument with respect to ROP rate? RSX will
never reach a rate of 8 pix/clk even for ARGB8.
In practise (at least on our data...) all the other stuff that is lost (and hard to get figures on) with FP16 adds up very quickly.
Ignoring MSAA, with a z-prepass it might just be worth using FP16 over NAO32 but turn on MSAA and turn off a z-prepass and the balence moves well into NAO32s favour IMHO
Well, my calculations ignored a z-prepass. 1.5x real overdraw (2.5x using your terminology) is a lot when hardware has top of the pipe Z reject. As for MSAA, well, you can't have FP16 with so NAO32 is the only option.
Just want to clear something up: I'm not dissing NAO32 or the work you've done. The quality improvement from being able to enable AA is awesome. I'm just saying that the performance advantage, if present, speaks more about NVidia's FP16 deficiencies. I'm sure some people in this forum think NAO32 halves the bandwidth requirement of HDR and is a perfect substitute for FP16, when things aren't so simple.