NV40: Why doesn't MSAA work with FP Blending?

Discussion in 'Architecture and Products' started by LeStoffer, Jul 6, 2004.

  1. Rolf N

    Rolf N Recurring Membmare
    Veteran

    Joined:
    Aug 18, 2003
    Messages:
    2,494
    Likes Received:
    55
    Location:
    yes
    I think what we're missing here is a definite answer to this:
     
  2. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    No MSAA on FP render targets. That's not related to blending (you can turn blending on or off any time you like).

    And AFAIK NV40 can only do FP16 blending on 4 pixels per clock .
     
  3. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Are you sure its 4? I'd been pushing Kirk on the fillrate hit but he was cagey about it, merely stating that "fillrate scales with bandwidth" indicating that if this uses twice the bandwidth then it would use twice the fillrate (i.e. 8 pixels per clock). ATI's texture samplers operate on FP textures by combining the components - i.e it can sample 4 8 bit components in a clock but if it was doing FP then it would sample these 4 components in 2 cycles for FP16 and 4 for FP32; I'd assumed the hit would be the same for NV40's output.
     
  4. _arsil

    Newcomer

    Joined:
    Mar 7, 2003
    Messages:
    11
    Likes Received:
    0
    Well... In theory enabling blending decreases performance by 50%. GPUs have to read and write memory. If you use 16bit per component you decrease performance by another 50%. So.. it is possible, that on 16 pipeline hardware 16 bit RGBA blending runs slower.

    In other words:
    You have 256 bit memory bus and memory is about 2-times faster than GPU. In theory you can modify 512 bits in a single GPU clock. Each pixel takes 64 bits (in 16 bit integer or float formats). If you enable blending you have to read and modify memory. So 512 / (64 * 2) = 4!

    It doesn't means, that things will be rendered 4 times slower or so!
    If you work with floating point buffers you usually use some complicated shaders, so GPU has a lot of free ticks to hide blending operations.
     
  5. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Arsil - I'm not talking about the bandwidth constraints, but the physical chip constraints.
     
  6. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    Dave, you're right about texture sampling. NV40 and R300/R420 seem to process FP texture the same way (without FP16 filtering with ATI gpu of course) -> more cycles for higher precision formats.

    NV40
    8bit textures : 1 cycle (with or without filtering)
    FP16 textures : 2 cycles (with or without filtering)
    FP32 textures : 4 cycles (without filtering)

    If the driver detects that not all components are used then it's faster.

    FP16 textures with 2 components used in the shader : 1 cycle
    FP32 textures with 3 components used in the shader : 3 cycles
    etc...

    I presume that's the same with R300/R420.


    Of course blending as nothing to do with texturing and it makes sense to have less blenders.
     
  7. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    No, there isn't necessarily a directl correlation, just my thought process was that if you can combine two integer components to read one FP16 component, or 4 integer components to read an FP32, similar things might happen at the output.

    However, I guess the 4 number is correct for FP16 blending if we remember DemoCoders comment that while 6800 can write 16 colour values, it can only blend 8 - if a similar thing is happening at the back end as the sampling end then it seems logical that it would blend at 4 pixels (although possibly still write at 8 FP16 values per clock).
     
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    While on the topic of blending, does anyone know if NV40 can blend more pixels per clock for <4 channel surfaces? DemoCoder said there are 8 blending units for either fixed or floating point, and now Dave is suggesting maybe only 4 are FP capable (which makes perfect sense from a bandwidth point of view). Either way, does anyone think throughput can be increased for single or double channel rendering?


    Anyway, back to MSAA. I can see why it isn't on the R3xx generation (and hence R420) because I don't think StretchRect was available or even in the works at the time, so offscreen MSAA buffers were rather pointless (FP buffers have to be offscreen).

    For NV40, I think they could have implemented it, but it might have meant disabling FB compression, not a big deal IMO for longer shaders. The required per sample blending slows things down quite a bit if you were to only use the available blenders instead of adding new ones, but again, not a big deal IMO. The final downsampling could be done by the FP16 filtering units during a StretchRect call.

    My guess is that NVidia thought it wasn't worth the headache to tie up all the loose ends involved. They'd need a slightly different rendering path down the hardware for the reduced performance FP MSAA I mentioned above. Given that FP rendering already has a big performance hit, gamers probably won't want to enable FSAA as well.

    I think NVidia made the right decision. The FP blending allows developers to experiment now, and the next gen will give them, as Dave mentioned, a fully orthogonal solution.
     
  9. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Minty (;)) - I'm not suggesting only 4 of the blendering units are FP capable, I would suggest that all 8 are, but only writing two components per cycle (i.e. 8 pixels in two cycles, bandwidth aside)
     
  10. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, if fixed-function, single-cycle blending is limited in such a way, the question is whether the throughput can be "bought back" by having a long shader.

    That is, if there are enough math instructions in the shader that does the blending to remove most of the memory bandwidth limitation, will it be possible to attain the same throughput with 8-bit blending or 16-bit blending? I really hope so.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...