Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 05-Oct-2011, 05:44   #1
Ethatron
Member
 
Join Date: Jan 2010
Posts: 375
Default Fixed vs. programmable

What is the current consensus about fe. blending vs. pixel-shader? Which is faster? I search for some article, nothing to be found. There are quite some operations you can do (at least half) in shaders:

ztest vs. discard (only on source-value)
alphatest vs. discard (only on source-alpha)
blending vs. mul/mad (possible with ping-pong rts)

I'm asking because I wanted to do SSAO-multiply with blending (allows me to use a single-channel rendertarget, which in turn allows fetch-4/gather, which in turn allows me to do a less costly or wider blur), but I want to be sure it's worth the effort.
Any nice articles comparing fixed vs. programmable, in terms of whats possible and how fast it is in both cases? It's a traditional topic for a B3D-article, right?

Last edited by Ethatron; 05-Oct-2011 at 08:04.
Ethatron is offline   Reply With Quote
Old 05-Oct-2011, 05:49   #2
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

I don't think the fixed function pipeline is there anymore. So you are really comparing your vs driver writer's emulation of fixed function stuff.

Which is why I would think the ff path would win, but not due to hw differences.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is online now   Reply With Quote
Old 05-Oct-2011, 09:42   #3
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,489
Default

you think a shader emulating fixed function is faster than a shader doing shading ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 05-Oct-2011, 09:42   #4
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
ztest vs. discard (only on source-value)
Only on source value? I assume discard shouldn't be too bad then although ztest would still be faster. ROPs are ridiculously good at all depth operations.
Quote:
alphatest vs. discard (only on source-alpha)
alphatest *is* discard ever since the G80/R600 generation. It's just software emulation.
Quote:
blending vs. mul/mad (possible with ping-pong rts)
ping-pong RTS is far from cheap in a real-world scenario unless there's a technique I'm not aware of...

More interestingly perhaps, Tegra already supports fully programmable blending with an OpenGL extension - there's no HW blender, it's all done in the pixel shader no matter what. PowerVR SGX also does it all in the pixel shader but it is not exposed on any existing device - it nearly certainly will be on the Sony PS Vita though. I know this doesn't apply for you but it does hint at the possibility of PC GPUs supporting programmable blending in the future.

Quote:
I'm asking because I wanted to do SSAO-multiply with blending (allows me to use a single-channel rendertarget, which in turn allows fetch-4/gather, which in turn allows me to do a less costly or wider blur), but I want to be sure it's worth the effort.
I suppose it mostly depends on how expensive the ping-pong RTS is for you? I admit I don't really know.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is online now   Reply With Quote
Old 05-Oct-2011, 10:55   #5
Ethatron
Member
 
Join Date: Jan 2010
Posts: 375
Default

Quote:
Originally Posted by Arun View Post
Only on source value?
I mean you just have the actual z-value available, you cant discard against the depth-buffer z-value. Well, in DX9 that is.

Quote:
Originally Posted by Arun View Post
I assume discard shouldn't be too bad then although ztest would still be faster. ROPs are ridiculously good at all depth operations.
Of course.

Quote:
Originally Posted by Arun View Post
ping-pong RTS is far from cheap in a real-world scenario unless there's a technique I'm not aware of...
I was considering that in the FF-hardware less data is moved around. With ping-ponging I have to read the entire previous framebuffer. With alpha-blend I could also rely on alpha==1.0 (my function is dst = src * ao) being optimized away without any predicate-insn.

Quote:
Originally Posted by Arun View Post
... supports fully programmable blending ...
I hope it's not in HLSL then. Well, nowaydays you probably could map any high-level language onto some primitive transistor-network configuration. I hope it's fast otherwise.

Quote:
Originally Posted by Arun View Post
I suppose it mostly depends on how expensive the ping-pong RTS is for you? I admit I don't really know.
Difficult to predict without numbers. I don't have any at hand I think. If I remember right I found all raw number benchmarks incomplete, not testing all types of transfers. And none doing shader-emulations of some FF things.

Currently I have:

a ) framebuffer read, calc AO, multiply, framebuffer write
vs.
b ) calc AO, framebuffer blend (ADD, SRC ZERO, DST SRCALPHA)

Normally on FP16 targets. I think the read-latency may be hidden behind the AO. Then it's one multiply + write vs. blend.
Ethatron is offline   Reply With Quote
Old 05-Oct-2011, 11:24   #6
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by Ethatron View Post
Currently I have:

a ) framebuffer read, calc AO, multiply, framebuffer write
vs.
b ) calc AO, framebuffer blend (ADD, SRC ZERO, DST SRCALPHA)

Normally on FP16 targets. I think the read-latency may be hidden behind the AO. Then it's one multiply + write vs. blend.
It really depends. If you are sampler (tex) or texture cache bound, the blender version seems to be the best choice (one less tex instruction). If you are alu bound, you will save one multiply instruction by using the blender. If you are BW bound, then both should be equally fast (blender needs to read the framebuffer pixels just like you need in your shader, and there's equal amount of writes in both cases). And if you are fill bound (the shader has just a few alu and tex instructions) then both should be equally fast (performance is limited by the fill rate). I don't know that much about specifics on newest low end PC graphics hardware (esp Intel GMA/HD), but it could be that some hardware has some extra penalties in blending (for example halving the fill rate - of course this doesn't matter if your shader is complex enough and is not fill bound at half rate). On Xbox you should definitely use the blender (blending to EDRAM is free).
sebbbi is offline   Reply With Quote
Old 06-Oct-2011, 13:13   #7
Ethatron
Member
 
Join Date: Jan 2010
Posts: 375
Default

So our little consensus is that utilizing (supposely) FF-functionality has a better best-case but no worst worst-case?

Edit: Awesome, I did it with HDAO as a proof-of-concept (it's one-pass, perfect). 1.34ms with blending, 2.02ms with regular i/o. Damn, what a fast AO. I emulate cross-shape gather via two Fetch-4 fetches, so it better be quick. :P

Setup is 1400x1050, FP16 target, INTZ depth, with alpha-test, Fetch-4.

Last edited by Ethatron; 06-Oct-2011 at 13:30.
Ethatron is offline   Reply With Quote
Old 06-Oct-2011, 19:12   #8
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,291
Send a message via ICQ to OpenGL guy
Default

Ping-pong doesn't work well with overlapping geometry as you won't be able to blend with things heading down the pipeline. That's a benefit of the fixed-function path and it guarantees ordering too.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 06-Oct-2011, 23:29   #9
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by Ethatron View Post
So our little consensus is that utilizing (supposely) FF-functionality has a better best-case but no worst worst-case?
It depends what you are looking for. The fixed function blender is pretty limited, it can only have two 4d inputs (existing pixel and the output from the pixel shader) and has very limited operations it can perform to those inputs. If you could feed it two outputs from pixel shader (fma blend mode) or if it could do dot products it would be much more useful for many tasks. Fully programmable blender would open up a lot of new opportunities (or having the possibility to get the existing render target color as input to pixel shader could of course make the blender completely useless, as every wanted blend mode could be implemented directly in the pixel shader). But this would be pretty hard to implement efficiently in the hardware (lots of new dependencies if ordering must be preserved).
sebbbi is offline   Reply With Quote
Old 07-Oct-2011, 03:50   #10
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,434
Default

Quote:
Originally Posted by sebbbi View Post
It depends what you are looking for. The fixed function blender is pretty limited, it can only have two 4d inputs (existing pixel and the output from the pixel shader) and has very limited operations it can perform to those inputs. If you could feed it two outputs from pixel shader (fma blend mode) or if it could do dot products it would be much more useful for many tasks. Fully programmable blender would open up a lot of new opportunities (or having the possibility to get the existing render target color as input to pixel shader could of course make the blender completely useless, as every wanted blend mode could be implemented directly in the pixel shader). But this would be pretty hard to implement efficiently in the hardware (lots of new dependencies if ordering must be preserved).
I just don't see the blend unit getting fully programmable - it will start to look like a shader alu sounds like a waste. Getting the render target as input into the shader alus sounds like a much more forward-looking idea, even if it looks like quite some effort to keep operations ordered and the data coherent.
mczak is offline   Reply With Quote
Old 07-Oct-2011, 12:13   #11
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by mczak View Post
I just don't see the blend unit getting fully programmable - it will start to look like a shader alu sounds like a waste.
Pixel shaders in current games are really complex (most are 50+ instructions). The fixed function blender does just two or three ALU ops. Adding those ops to the end of the pixel shader when blending is needed wouldn't be a huge cost. And we could save all the transistors needed for the current fixed function blender. The bigger problems are the datapaths, latency and ordering. ALU shouldn't be a problem anymore (and even less in the future, as ALU performance has always increased lot faster than memory bandwidth).
sebbbi is offline   Reply With Quote
Old 07-Oct-2011, 14:07   #12
Simon F
Tea maker
 
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,379
Default

Quote:
Originally Posted by mczak View Post
I just don't see the blend unit getting fully programmable .
In some devices, the "blend unit" is fully programmable in that it is just part of the shader.
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson

"I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay
Simon F is offline   Reply With Quote
Old 07-Oct-2011, 15:22   #13
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,434
Default

Quote:
Originally Posted by Simon F View Post
In some devices, the "blend unit" is fully programmable in that it is just part of the shader.
Yes SGX doesn't have a problem getting the render target color into the shader alus...
So I think "more traditional gpus" will solve the problem with the render target data getting into the shader rather than trying to make the blend unit itself more programmable (hence the "blend unit" won't really get programmable it will just disappear at the hardware level).

Last edited by mczak; 07-Oct-2011 at 15:36.
mczak is offline   Reply With Quote
Old 09-Oct-2011, 20:35   #14
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,216
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Quote:
Originally Posted by Arun View Post
alphatest *is* discard ever since the G80/R600 generation. It's just software emulation.
Do drivers optimize the shader to put the discard as early as possible, or just dump it at the end? If the latter, building it into the shader should be faster. Same with discard vs z-test, z-test happens after the shader, discard can happen very early in shader.
__________________
[ Visit my site ]
I speak for myself and only myself.
Humus is offline   Reply With Quote
Old 09-Oct-2011, 21:40   #15
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,291
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by Humus View Post
Do drivers optimize the shader to put the discard as early as possible, or just dump it at the end? If the latter, building it into the shader should be faster. Same with discard vs z-test, z-test happens after the shader, discard can happen very early in shader.
Z rejection can happen before the shader in many cases.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 10-Oct-2011, 07:15   #16
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by OpenGL guy View Post
Z rejection can happen before the shader in many cases.
In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once). PowerVR of course does things very differently...

Quote:
Originally Posted by Humus View Post
Do drivers optimize the shader to put the discard as early as possible, or just dump it at the end? If the latter, building it into the shader should be faster. Same with discard vs z-test, z-test happens after the shader, discard can happen very early in shader.
I have managed noticeable performance improvements by using dynamic branching with discard. I detect the discard condition as soon as possible, and if the pixel is discarded I dynamic branch out the rest of the instructions. It doesn't seem to matter where you put the discard instruction, discard seems to only put a flag on to cull the pixel after the pixel shader is complete. Drivers could optimize the shader by doing the dynamic branch automatically, but it would require cost/benefit analysis, and you never know the shader usage patters at compile time. For example dynamic branching with discard improves our tree branch rendering very much (so many pixels that are clipped), but it would decrease performance on a highly repeating checkerboard clip pattern (branching granularity is not fine enough).
sebbbi is offline   Reply With Quote
Old 10-Oct-2011, 07:29   #17
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,291
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by sebbbi View Post
In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once). PowerVR of course does things very differently...
PowerVR isn't the only exception. See page 99 of the R300 register spec.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 10-Oct-2011, 07:33   #18
silent_guy
Senior Member
 
Join Date: Mar 2006
Posts: 1,682
Default

Quote:
Originally Posted by sebbbi View Post
In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once).
I thought both AMD and Nvidia have something like early Z rejection? If the pixel shader doesn't modify the Z (and some other boundary condition wrt alpha, I suppose) then it should be able to drop pixels completely if it can be declared up front that it won't be visible.

(See here: http://www.slideshare.net/pjcozzi/z-...-optimizations)

I don't know to what extent modern graphics fall within those boundary conditions, though...
silent_guy is offline   Reply With Quote
Old 10-Oct-2011, 09:48   #19
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
Originally Posted by sebbbi View Post
In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once). PowerVR of course does things very differently...
That was true for the GeForce 3 and Radeon 8500 iirc, but things are very different nowadays and as OpenGL Guy said, even the R300 could do the fine-grained Z-Test systematically before the pixel shader if allowed to.

Now that I think about it I wouldn't be surprised if NVIDIA only added that in the G80 timeframe, although it's been too long for me to trust my memory on that... And that reminds me I wonder how Tegra handles it - apparently it can reject 4 pixels per clock, but is that a tile-aligned quad or just four arbitrary pixels like in a very thin triangle (excluding quad-based shading inefficiencies)? Not that it really matters in the end with that kind of granularity.

Also interesting info on how dynamic branching helps discard/alpha test performance, I must admit I never thought about that, cheers!
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is online now   Reply With Quote
Old 10-Oct-2011, 10:46   #20
Ethatron
Member
 
Join Date: Jan 2010
Posts: 375
Default

Quote:
Originally Posted by Arun View Post
Also interesting info on how dynamic branching helps discard/alpha test performance, I must admit I never thought about that, cheers!
Yes, in the case of my AO experiment I discard-branch fe. the sky. If I'd use a branched return-a-value it'd be slower (DX9 has no return-a-value branch-out, it's doing the full shader, then conditional). The area of the sky is often big enough to short-cut the entire wavefront I assume, all of them seem to branch out as intended. I sped up HBAO by 25% this way (discard + blend + alpha-test). HBAO is also tex-fetch limited, so a branch should reduce cache-trashing in this specific case.
Ethatron is offline   Reply With Quote
Old 10-Oct-2011, 13:25   #21
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by silent_guy View Post
I thought both AMD and Nvidia have something like early Z rejection? If the pixel shader doesn't modify the Z (and some other boundary condition wrt alpha, I suppose) then it should be able to drop pixels completely if it can be declared up front that it won't be visible.
HiZ is an early Z optimization technique. It occurs before the pixel shader. If you wanted to do per pixel early Z culling, you would need to read the Z-buffer pixels from the memory before spawning the pixel shader threads. HiZ on the other hand is just a small optimization structure. It doesn't require memory accesses (can be kept on chip) as it has coarse resolution and coarse bit depth. I don't know if new PC hardware has other early Z functionality besides HiZ. It's entirely possible, if the chip can hide the latency of fetching depth values before executing the pixel shader. But if there's a mechanism for perfect 1:1 pixel based early depth culling, the same mechanisms could be likely adapted for programmable blending as well (fetch both pixel depth and color before spawning the pixel threads, and feed the color as pixel shader input = programmable blending).

Pixel shader calculates pixel alpha value, so early alpha culling is not possible (you always need to run the shader). Stencil and depth are the only values that are known before the pixel shader (unless pixel shader writes to oDepth).

Last edited by sebbbi; 10-Oct-2011 at 13:41.
sebbbi is offline   Reply With Quote
Old 10-Oct-2011, 17:18   #22
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,216
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Quote:
Originally Posted by OpenGL guy View Post
Z rejection can happen before the shader in many cases.
Of course. My understanding of the OP was that we were only talking about having a z value available in the shader, so no ROP level culling.

Quote:
Originally Posted by Arun View Post
Now that I think about it I wouldn't be surprised if NVIDIA only added that in the G80 timeframe, although it's been too long for me to trust my memory on that...
I could be wrong, but I would say yes on that, based on that the RSX only has coarse culling on Z and S. On top of my head I think it's the same for Xenon as well.

Quote:
Originally Posted by sebbbi View Post
It doesn't seem to matter where you put the discard instruction, discard seems to only put a flag on to cull the pixel after the pixel shader is complete.
Many HW detect if all pixels in a quad are dead, and in that case skip subsequent texture fetching. So while all instructions are still executed, it eases the pressure on the texture cache and memory. Also there will typically be no writes on the backend either.
__________________
[ Visit my site ]
I speak for myself and only myself.
Humus is offline   Reply With Quote
Old 10-Oct-2011, 17:50   #23
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,434
Default

Quote:
Originally Posted by sebbbi View Post
HiZ is an early Z optimization technique. It occurs before the pixel shader. If you wanted to do per pixel early Z culling, you would need to read the Z-buffer pixels from the memory before spawning the pixel shader threads. HiZ on the other hand is just a small optimization structure. It doesn't require memory accesses (can be kept on chip) as it has coarse resolution and coarse bit depth. I don't know if new PC hardware has other early Z functionality besides HiZ. It's entirely possible, if the chip can hide the latency of fetching depth values before executing the pixel shader. But if there's a mechanism for perfect 1:1 pixel based early depth culling, the same mechanisms could be likely adapted for programmable blending as well (fetch both pixel depth and color before spawning the pixel threads, and feed the color as pixel shader input = programmable blending).
The chips no longer have HiZ on-chip memory (not quite sure about nvidia, but since ati r600 and also intel IGP which didn't have HiZ before) but indeed instead store the HiZ data alongside the normal depth data - the obvious disadvantage being that this could indeed probably increase bandwidth needed for depth reads/writes in some cases, but you no longer have problems switching depth buffers and the like. But I believe all current chips indeed do fine-grained early Z culling - you can also program them to do z tests both before and after the pixel shader depending on stencil/z (and shader) state. There are some optimization notes even in some amd manuals that early-z might not always be a win if the shader is short and other things. And it's not just early z reads it's also writes - there's even a "auto" bit somewhere which will decide when z tests will run exactly depending on the bound shader (so depending on if the shader does z writes and or discard).
mczak is offline   Reply With Quote
Old 11-Oct-2011, 04:00   #24
Ethatron
Member
 
Join Date: Jan 2010
Posts: 375
Default

Quote:
Originally Posted by Humus View Post
Also there will typically be no writes on the backend either.
Which can be logged with a stencil-buffer which then can lead to additional performance-improvements. Unnessessary to mention stencil buffers are FF as well.
Ethatron is offline   Reply With Quote
Old 11-Oct-2011, 12:35   #25
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

I am still living in the world of 2005 GPUs

But the hardware in the 2005 consoles was actually pretty modern in many ways. Memexport is more flexible than DX10 stream out. We had to wait for DX11 compute shaders to have something exceeding it. PC CPUs with vector FMA haven't yet even been released (Bulldozer and Haswell are first to support it). And PC received many other nice CPU instructions much later than current generation consoles. I just hope we will get as much forward looking console hardware in next generation. I would want to get my hands dirty with 512 bit AVX / gather and to finally do something productive on high end compute shaders (and not only some sorting/data processing algorithms for research purposes).
sebbbi is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 01:49.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.