Fixed vs. programmable

Discussion in 'Architecture and Products' started by Ethatron, Oct 5, 2011.

  1. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    934
    Likes Received:
    379
    What is the current consensus about fe. blending vs. pixel-shader? Which is faster? I search for some article, nothing to be found. There are quite some operations you can do (at least half) in shaders:

    ztest vs. discard (only on source-value)
    alphatest vs. discard (only on source-alpha)
    blending vs. mul/mad (possible with ping-pong rts)

    I'm asking because I wanted to do SSAO-multiply with blending (allows me to use a single-channel rendertarget, which in turn allows fetch-4/gather, which in turn allows me to do a less costly or wider blur), but I want to be sure it's worth the effort.
    Any nice articles comparing fixed vs. programmable, in terms of whats possible and how fast it is in both cases? It's a traditional topic for a B3D-article, right?
     
    #1 Ethatron, Oct 5, 2011
    Last edited by a moderator: Oct 5, 2011
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I don't think the fixed function pipeline is there anymore. So you are really comparing your vs driver writer's emulation of fixed function stuff.

    Which is why I would think the ff path would win, but not due to hw differences.
     
  3. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    16,872
    Likes Received:
    4,196
    you think a shader emulating fixed function is faster than a shader doing shading ?
     
  4. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Only on source value? I assume discard shouldn't be too bad then although ztest would still be faster. ROPs are ridiculously good at all depth operations.
    alphatest *is* discard ever since the G80/R600 generation. It's just software emulation.
    ping-pong RTS is far from cheap in a real-world scenario unless there's a technique I'm not aware of...

    More interestingly perhaps, Tegra already supports fully programmable blending with an OpenGL extension - there's no HW blender, it's all done in the pixel shader no matter what. PowerVR SGX also does it all in the pixel shader but it is not exposed on any existing device - it nearly certainly will be on the Sony PS Vita though. I know this doesn't apply for you but it does hint at the possibility of PC GPUs supporting programmable blending in the future.

    I suppose it mostly depends on how expensive the ping-pong RTS is for you? I admit I don't really know.
     
  5. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    934
    Likes Received:
    379
    I mean you just have the actual z-value available, you cant discard against the depth-buffer z-value. Well, in DX9 that is.

    Of course. :)

    I was considering that in the FF-hardware less data is moved around. With ping-ponging I have to read the entire previous framebuffer. With alpha-blend I could also rely on alpha==1.0 (my function is dst = src * ao) being optimized away without any predicate-insn.

    I hope it's not in HLSL then. :) Well, nowaydays you probably could map any high-level language onto some primitive transistor-network configuration. I hope it's fast otherwise.

    Difficult to predict without numbers. I don't have any at hand I think. If I remember right I found all raw number benchmarks incomplete, not testing all types of transfers. And none doing shader-emulations of some FF things.

    Currently I have:

    a ) framebuffer read, calc AO, multiply, framebuffer write
    vs.
    b ) calc AO, framebuffer blend (ADD, SRC ZERO, DST SRCALPHA)

    Normally on FP16 targets. I think the read-latency may be hidden behind the AO. Then it's one multiply + write vs. blend.
     
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    It really depends. If you are sampler (tex) or texture cache bound, the blender version seems to be the best choice (one less tex instruction). If you are alu bound, you will save one multiply instruction by using the blender. If you are BW bound, then both should be equally fast (blender needs to read the framebuffer pixels just like you need in your shader, and there's equal amount of writes in both cases). And if you are fill bound (the shader has just a few alu and tex instructions) then both should be equally fast (performance is limited by the fill rate). I don't know that much about specifics on newest low end PC graphics hardware (esp Intel GMA/HD), but it could be that some hardware has some extra penalties in blending (for example halving the fill rate - of course this doesn't matter if your shader is complex enough and is not fill bound at half rate). On Xbox you should definitely use the blender (blending to EDRAM is free).
     
  7. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    934
    Likes Received:
    379
    So our little consensus is that utilizing (supposely) FF-functionality has a better best-case but no worst worst-case?

    Edit: Awesome, I did it with HDAO as a proof-of-concept (it's one-pass, perfect). 1.34ms with blending, 2.02ms with regular i/o. Damn, what a fast AO. I emulate cross-shape gather via two Fetch-4 fetches, so it better be quick. :p

    Setup is 1400x1050, FP16 target, INTZ depth, with alpha-test, Fetch-4.
     
    #7 Ethatron, Oct 6, 2011
    Last edited by a moderator: Oct 6, 2011
  8. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Ping-pong doesn't work well with overlapping geometry as you won't be able to blend with things heading down the pipeline. That's a benefit of the fixed-function path and it guarantees ordering too.
     
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    It depends what you are looking for. The fixed function blender is pretty limited, it can only have two 4d inputs (existing pixel and the output from the pixel shader) and has very limited operations it can perform to those inputs. If you could feed it two outputs from pixel shader (fma blend mode) or if it could do dot products it would be much more useful for many tasks. Fully programmable blender would open up a lot of new opportunities (or having the possibility to get the existing render target color as input to pixel shader could of course make the blender completely useless, as every wanted blend mode could be implemented directly in the pixel shader). But this would be pretty hard to implement efficiently in the hardware (lots of new dependencies if ordering must be preserved).
     
  10. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,021
    Likes Received:
    119
    I just don't see the blend unit getting fully programmable - it will start to look like a shader alu sounds like a waste. Getting the render target as input into the shader alus sounds like a much more forward-looking idea, even if it looks like quite some effort to keep operations ordered and the data coherent.
     
  11. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Pixel shaders in current games are really complex (most are 50+ instructions). The fixed function blender does just two or three ALU ops. Adding those ops to the end of the pixel shader when blending is needed wouldn't be a huge cost. And we could save all the transistors needed for the current fixed function blender. The bigger problems are the datapaths, latency and ordering. ALU shouldn't be a problem anymore (and even less in the future, as ALU performance has always increased lot faster than memory bandwidth).
     
  12. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    In some devices, the "blend unit" is fully programmable in that it is just part of the shader.
     
  13. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,021
    Likes Received:
    119
    Yes SGX doesn't have a problem getting the render target color into the shader alus...
    So I think "more traditional gpus" will solve the problem with the render target data getting into the shader rather than trying to make the blend unit itself more programmable (hence the "blend unit" won't really get programmable it will just disappear at the hardware level).
     
    #13 mczak, Oct 7, 2011
    Last edited by a moderator: Oct 7, 2011
  14. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    Do drivers optimize the shader to put the discard as early as possible, or just dump it at the end? If the latter, building it into the shader should be faster. Same with discard vs z-test, z-test happens after the shader, discard can happen very early in shader.
     
  15. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    Z rejection can happen before the shader in many cases.
     
  16. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once). PowerVR of course does things very differently...

    I have managed noticeable performance improvements by using dynamic branching with discard. I detect the discard condition as soon as possible, and if the pixel is discarded I dynamic branch out the rest of the instructions. It doesn't seem to matter where you put the discard instruction, discard seems to only put a flag on to cull the pixel after the pixel shader is complete. Drivers could optimize the shader by doing the dynamic branch automatically, but it would require cost/benefit analysis, and you never know the shader usage patters at compile time. For example dynamic branching with discard improves our tree branch rendering very much (so many pixels that are clipped), but it would decrease performance on a highly repeating checkerboard clip pattern (branching granularity is not fine enough).
     
  17. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    PowerVR isn't the only exception. See page 99 of the R300 register spec.
     
  18. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    I thought both AMD and Nvidia have something like early Z rejection? If the pixel shader doesn't modify the Z (and some other boundary condition wrt alpha, I suppose) then it should be able to drop pixels completely if it can be declared up front that it won't be visible.

    (See here: http://www.slideshare.net/pjcozzi/z-buffer-optimizations)

    I don't know to what extent modern graphics fall within those boundary conditions, though...
     
  19. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    That was true for the GeForce 3 and Radeon 8500 iirc, but things are very different nowadays and as OpenGL Guy said, even the R300 could do the fine-grained Z-Test systematically before the pixel shader if allowed to.

    Now that I think about it I wouldn't be surprised if NVIDIA only added that in the G80 timeframe, although it's been too long for me to trust my memory on that... And that reminds me I wonder how Tegra handles it - apparently it can reject 4 pixels per clock, but is that a tile-aligned quad or just four arbitrary pixels like in a very thin triangle (excluding quad-based shading inefficiencies)? Not that it really matters in the end with that kind of granularity.

    Also interesting info on how dynamic branching helps discard/alpha test performance, I must admit I never thought about that, cheers!
     
  20. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    934
    Likes Received:
    379
    Yes, in the case of my AO experiment I discard-branch fe. the sky. If I'd use a branched return-a-value it'd be slower (DX9 has no return-a-value branch-out, it's doing the full shader, then conditional). The area of the sky is often big enough to short-cut the entire wavefront I assume, all of them seem to branch out as intended. I sped up HBAO by 25% this way (discard + blend + alpha-test). HBAO is also tex-fetch limited, so a branch should reduce cache-trashing in this specific case.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...