If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Member
Join Date: Jan 2010
Posts: 375
|
What is the current consensus about fe. blending vs. pixel-shader? Which is faster? I search for some article, nothing to be found. There are quite some operations you can do (at least half) in shaders:
ztest vs. discard (only on source-value) alphatest vs. discard (only on source-alpha) blending vs. mul/mad (possible with ping-pong rts) I'm asking because I wanted to do SSAO-multiply with blending (allows me to use a single-channel rendertarget, which in turn allows fetch-4/gather, which in turn allows me to do a less costly or wider blur), but I want to be sure it's worth the effort. Any nice articles comparing fixed vs. programmable, in terms of whats possible and how fast it is in both cases? It's a traditional topic for a B3D-article, right? Last edited by Ethatron; 05-Oct-2011 at 08:04. |
|
|
|
|
|
#2 |
|
Senior Member
|
I don't think the fixed function pipeline is there anymore. So you are really comparing your vs driver writer's emulation of fixed function stuff.
Which is why I would think the ff path would win, but not due to hw differences. |
|
|
|
|
|
#3 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,489
|
you think a shader emulating fixed function is faster than a shader doing shading ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#4 | ||||
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Quote:
Quote:
Quote:
More interestingly perhaps, Tegra already supports fully programmable blending with an OpenGL extension - there's no HW blender, it's all done in the pixel shader no matter what. PowerVR SGX also does it all in the pixel shader but it is not exposed on any existing device - it nearly certainly will be on the Sony PS Vita though. I know this doesn't apply for you but it does hint at the possibility of PC GPUs supporting programmable blending in the future. Quote:
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
||||
|
|
|
|
|
#5 | |||
|
Member
Join Date: Jan 2010
Posts: 375
|
I mean you just have the actual z-value available, you cant discard against the depth-buffer z-value. Well, in DX9 that is.
Quote:
Quote:
I hope it's not in HLSL then. Quote:
Currently I have: a ) framebuffer read, calc AO, multiply, framebuffer write vs. b ) calc AO, framebuffer blend (ADD, SRC ZERO, DST SRCALPHA) Normally on FP16 targets. I think the read-latency may be hidden behind the AO. Then it's one multiply + write vs. blend. |
|||
|
|
|
|
|
#6 | |
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
|
|
|
|
|
|
|
#7 |
|
Member
Join Date: Jan 2010
Posts: 375
|
So our little consensus is that utilizing (supposely) FF-functionality has a better best-case but no worst worst-case?
Edit: Awesome, I did it with HDAO as a proof-of-concept (it's one-pass, perfect). 1.34ms with blending, 2.02ms with regular i/o. Damn, what a fast AO. I emulate cross-shape gather via two Fetch-4 fetches, so it better be quick. :P Setup is 1400x1050, FP16 target, INTZ depth, with alpha-test, Fetch-4. Last edited by Ethatron; 06-Oct-2011 at 13:30. |
|
|
|
|
|
#8 |
|
Senior Member
|
Ping-pong doesn't work well with overlapping geometry as you won't be able to blend with things heading down the pipeline. That's a benefit of the fixed-function path and it guarantees ordering too.
__________________
I speak only for myself. |
|
|
|
|
|
#9 |
|
Member
Join Date: Nov 2007
Posts: 938
|
It depends what you are looking for. The fixed function blender is pretty limited, it can only have two 4d inputs (existing pixel and the output from the pixel shader) and has very limited operations it can perform to those inputs. If you could feed it two outputs from pixel shader (fma blend mode) or if it could do dot products it would be much more useful for many tasks. Fully programmable blender would open up a lot of new opportunities (or having the possibility to get the existing render target color as input to pixel shader could of course make the blender completely useless, as every wanted blend mode could be implemented directly in the pixel shader). But this would be pretty hard to implement efficiently in the hardware (lots of new dependencies if ordering must be preserved).
|
|
|
|
|
|
#10 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,434
|
Quote:
|
|
|
|
|
|
|
#11 |
|
Member
Join Date: Nov 2007
Posts: 938
|
Pixel shaders in current games are really complex (most are 50+ instructions). The fixed function blender does just two or three ALU ops. Adding those ops to the end of the pixel shader when blending is needed wouldn't be a huge cost. And we could save all the transistors needed for the current fixed function blender. The bigger problems are the datapaths, latency and ordering. ALU shouldn't be a problem anymore (and even less in the future, as ALU performance has always increased lot faster than memory bandwidth).
|
|
|
|
|
|
#12 |
|
Tea maker
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,379
|
In some devices, the "blend unit" is fully programmable in that it is just part of the shader.
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson "I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay |
|
|
|
|
|
#13 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,434
|
Quote:
So I think "more traditional gpus" will solve the problem with the render target data getting into the shader rather than trying to make the blend unit itself more programmable (hence the "blend unit" won't really get programmable it will just disappear at the hardware level). Last edited by mczak; 07-Oct-2011 at 15:36. |
|
|
|
|
|
|
#14 |
|
Crazy coder
|
Do drivers optimize the shader to put the discard as early as possible, or just dump it at the end? If the latter, building it into the shader should be faster. Same with discard vs z-test, z-test happens after the shader, discard can happen very early in shader.
|
|
|
|
|
|
#15 |
|
Senior Member
|
Z rejection can happen before the shader in many cases.
__________________
I speak only for myself. |
|
|
|
|
|
#16 |
|
Member
Join Date: Nov 2007
Posts: 938
|
In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once). PowerVR of course does things very differently...
I have managed noticeable performance improvements by using dynamic branching with discard. I detect the discard condition as soon as possible, and if the pixel is discarded I dynamic branch out the rest of the instructions. It doesn't seem to matter where you put the discard instruction, discard seems to only put a flag on to cull the pixel after the pixel shader is complete. Drivers could optimize the shader by doing the dynamic branch automatically, but it would require cost/benefit analysis, and you never know the shader usage patters at compile time. For example dynamic branching with discard improves our tree branch rendering very much (so many pixels that are clipped), but it would decrease performance on a highly repeating checkerboard clip pattern (branching granularity is not fine enough). |
|
|
|
|
|
#17 | |
|
Senior Member
|
Quote:
__________________
I speak only for myself. |
|
|
|
|
|
|
#18 | |
|
Senior Member
Join Date: Mar 2006
Posts: 1,682
|
Quote:
(See here: http://www.slideshare.net/pjcozzi/z-...-optimizations) I don't know to what extent modern graphics fall within those boundary conditions, though... |
|
|
|
|
|
|
#19 | |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Quote:
Now that I think about it I wouldn't be surprised if NVIDIA only added that in the G80 timeframe, although it's been too long for me to trust my memory on that... And that reminds me I wonder how Tegra handles it - apparently it can reject 4 pixels per clock, but is that a tile-aligned quad or just four arbitrary pixels like in a very thin triangle (excluding quad-based shading inefficiencies)? Not that it really matters in the end with that kind of granularity. Also interesting info on how dynamic branching helps discard/alpha test performance, I must admit I never thought about that, cheers!
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
|
|
#20 |
|
Member
Join Date: Jan 2010
Posts: 375
|
Yes, in the case of my AO experiment I discard-branch fe. the sky. If I'd use a branched return-a-value it'd be slower (DX9 has no return-a-value branch-out, it's doing the full shader, then conditional). The area of the sky is often big enough to short-cut the entire wavefront I assume, all of them seem to branch out as intended. I sped up HBAO by 25% this way (discard + blend + alpha-test). HBAO is also tex-fetch limited, so a branch should reduce cache-trashing in this specific case.
|
|
|
|
|
|
#21 | |
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
Pixel shader calculates pixel alpha value, so early alpha culling is not possible (you always need to run the shader). Stencil and depth are the only values that are known before the pixel shader (unless pixel shader writes to oDepth). Last edited by sebbbi; 10-Oct-2011 at 13:41. |
|
|
|
|
|
|
#22 | |
|
Crazy coder
|
Of course. My understanding of the OP was that we were only talking about having a z value available in the shader, so no ROP level culling.
Quote:
Many HW detect if all pixels in a quad are dead, and in that case skip subsequent texture fetching. So while all instructions are still executed, it eases the pressure on the texture cache and memory. Also there will typically be no writes on the backend either. |
|
|
|
|
|
|
#23 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,434
|
Quote:
|
|
|
|
|
|
|
#24 |
|
Member
Join Date: Jan 2010
Posts: 375
|
Which can be logged with a stencil-buffer which then can lead to additional performance-improvements. Unnessessary to mention stencil buffers are FF as well.
|
|
|
|
|
|
#25 |
|
Member
Join Date: Nov 2007
Posts: 938
|
I am still living in the world of 2005 GPUs
But the hardware in the 2005 consoles was actually pretty modern in many ways. Memexport is more flexible than DX10 stream out. We had to wait for DX11 compute shaders to have something exceeding it. PC CPUs with vector FMA haven't yet even been released (Bulldozer and Haswell are first to support it). And PC received many other nice CPU instructions much later than current generation consoles. I just hope we will get as much forward looking console hardware in next generation. I would want to get my hands dirty with 512 bit AVX / gather and to finally do something productive on high end compute shaders (and not only some sorting/data processing algorithms for research purposes). |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|