NV40: Why doesn't MSAA work with FP Blending?

I think what we're missing here is a definite answer to this:
Mintmaster said:
Is it just FP blending that doesn't work with MSAA? I thought it was FP rendering in general that doesn't work with MSAA on all DX9 cards, but of course I could be wrong.
 
No MSAA on FP render targets. That's not related to blending (you can turn blending on or off any time you like).

And AFAIK NV40 can only do FP16 blending on 4 pixels per clock .
 
Are you sure its 4? I'd been pushing Kirk on the fillrate hit but he was cagey about it, merely stating that "fillrate scales with bandwidth" indicating that if this uses twice the bandwidth then it would use twice the fillrate (i.e. 8 pixels per clock). ATI's texture samplers operate on FP textures by combining the components - i.e it can sample 4 8 bit components in a clock but if it was doing FP then it would sample these 4 components in 2 cycles for FP16 and 4 for FP32; I'd assumed the hit would be the same for NV40's output.
 
DaveBaumann said:
Are you sure its 4?
Well... In theory enabling blending decreases performance by 50%. GPUs have to read and write memory. If you use 16bit per component you decrease performance by another 50%. So.. it is possible, that on 16 pipeline hardware 16 bit RGBA blending runs slower.

In other words:
You have 256 bit memory bus and memory is about 2-times faster than GPU. In theory you can modify 512 bits in a single GPU clock. Each pixel takes 64 bits (in 16 bit integer or float formats). If you enable blending you have to read and modify memory. So 512 / (64 * 2) = 4!

It doesn't means, that things will be rendered 4 times slower or so!
If you work with floating point buffers you usually use some complicated shaders, so GPU has a lot of free ticks to hide blending operations.
 
Dave, you're right about texture sampling. NV40 and R300/R420 seem to process FP texture the same way (without FP16 filtering with ATI gpu of course) -> more cycles for higher precision formats.

NV40
8bit textures : 1 cycle (with or without filtering)
FP16 textures : 2 cycles (with or without filtering)
FP32 textures : 4 cycles (without filtering)

If the driver detects that not all components are used then it's faster.

FP16 textures with 2 components used in the shader : 1 cycle
FP32 textures with 3 components used in the shader : 3 cycles
etc...

I presume that's the same with R300/R420.


Of course blending as nothing to do with texturing and it makes sense to have less blenders.
 
Tridam said:
Of course blending as nothing to do with texturing and it makes sense to have less blenders.

No, there isn't necessarily a directl correlation, just my thought process was that if you can combine two integer components to read one FP16 component, or 4 integer components to read an FP32, similar things might happen at the output.

However, I guess the 4 number is correct for FP16 blending if we remember DemoCoders comment that while 6800 can write 16 colour values, it can only blend 8 - if a similar thing is happening at the back end as the sampling end then it seems logical that it would blend at 4 pixels (although possibly still write at 8 FP16 values per clock).
 
While on the topic of blending, does anyone know if NV40 can blend more pixels per clock for <4 channel surfaces? DemoCoder said there are 8 blending units for either fixed or floating point, and now Dave is suggesting maybe only 4 are FP capable (which makes perfect sense from a bandwidth point of view). Either way, does anyone think throughput can be increased for single or double channel rendering?


Anyway, back to MSAA. I can see why it isn't on the R3xx generation (and hence R420) because I don't think StretchRect was available or even in the works at the time, so offscreen MSAA buffers were rather pointless (FP buffers have to be offscreen).

For NV40, I think they could have implemented it, but it might have meant disabling FB compression, not a big deal IMO for longer shaders. The required per sample blending slows things down quite a bit if you were to only use the available blenders instead of adding new ones, but again, not a big deal IMO. The final downsampling could be done by the FP16 filtering units during a StretchRect call.

My guess is that NVidia thought it wasn't worth the headache to tie up all the loose ends involved. They'd need a slightly different rendering path down the hardware for the reduced performance FP MSAA I mentioned above. Given that FP rendering already has a big performance hit, gamers probably won't want to enable FSAA as well.

I think NVidia made the right decision. The FP blending allows developers to experiment now, and the next gen will give them, as Dave mentioned, a fully orthogonal solution.
 
Minty (;)) - I'm not suggesting only 4 of the blendering units are FP capable, I would suggest that all 8 are, but only writing two components per cycle (i.e. 8 pixels in two cycles, bandwidth aside)
 
Well, if fixed-function, single-cycle blending is limited in such a way, the question is whether the throughput can be "bought back" by having a long shader.

That is, if there are enough math instructions in the shader that does the blending to remove most of the memory bandwidth limitation, will it be possible to attain the same throughput with 8-bit blending or 16-bit blending? I really hope so.
 
Back
Top