Fixed vs. programmable

Ethatron · Oct 5, 2011

What is the current consensus about fe. blending vs. pixel-shader? Which is faster? I search for some article, nothing to be found. There are quite some operations you can do (at least half) in shaders:

ztest vs. discard (only on source-value)
alphatest vs. discard (only on source-alpha)
blending vs. mul/mad (possible with ping-pong rts)

I'm asking because I wanted to do SSAO-multiply with blending (allows me to use a single-channel rendertarget, which in turn allows fetch-4/gather, which in turn allows me to do a less costly or wider blur), but I want to be sure it's worth the effort.
Any nice articles comparing fixed vs. programmable, in terms of whats possible and how fast it is in both cases? It's a traditional topic for a B3D-article, right?

rpg.314 · Oct 5, 2011

I don't think the fixed function pipeline is there anymore. So you are really comparing your vs driver writer's emulation of fixed function stuff.

Which is why I would think the ff path would win, but not due to hw differences.

Davros · Oct 5, 2011

you think a shader emulating fixed function is faster than a shader doing shading ?

Arun · Oct 5, 2011

ztest vs. discard (only on source-value)

Only on source value? I assume discard shouldn't be too bad then although ztest would still be faster. ROPs are ridiculously good at all depth operations.

alphatest vs. discard (only on source-alpha)

alphatest *is* discard ever since the G80/R600 generation. It's just software emulation.

blending vs. mul/mad (possible with ping-pong rts)

ping-pong RTS is far from cheap in a real-world scenario unless there's a technique I'm not aware of...

More interestingly perhaps, Tegra already supports fully programmable blending with an OpenGL extension - there's no HW blender, it's all done in the pixel shader no matter what. PowerVR SGX also does it all in the pixel shader but it is not exposed on any existing device - it nearly certainly will be on the Sony PS Vita though. I know this doesn't apply for you but it does hint at the possibility of PC GPUs supporting programmable blending in the future.

I'm asking because I wanted to do SSAO-multiply with blending (allows me to use a single-channel rendertarget, which in turn allows fetch-4/gather, which in turn allows me to do a less costly or wider blur), but I want to be sure it's worth the effort.

I suppose it mostly depends on how expensive the ping-pong RTS is for you? I admit I don't really know.

Ethatron · Oct 5, 2011

Arun said:
Only on source value?

I mean you just have the actual z-value available, you cant discard against the depth-buffer z-value. Well, in DX9 that is.

Arun said:
I assume discard shouldn't be too bad then although ztest would still be faster. ROPs are ridiculously good at all depth operations.

Of course.

Arun said:
ping-pong RTS is far from cheap in a real-world scenario unless there's a technique I'm not aware of...

I was considering that in the FF-hardware less data is moved around. With ping-ponging I have to read the entire previous framebuffer. With alpha-blend I could also rely on alpha==1.0 (my function is dst = src * ao) being optimized away without any predicate-insn.

Arun said:
... supports fully programmable blending ...

I hope it's not in HLSL then.

Well, nowaydays you probably could map any high-level language onto some primitive transistor-network configuration. I hope it's fast otherwise.

Arun said:
I suppose it mostly depends on how expensive the ping-pong RTS is for you? I admit I don't really know.

Difficult to predict without numbers. I don't have any at hand I think. If I remember right I found all raw number benchmarks incomplete, not testing all types of transfers. And none doing shader-emulations of some FF things.

Currently I have:

a ) framebuffer read, calc AO, multiply, framebuffer write
vs.
b ) calc AO, framebuffer blend (ADD, SRC ZERO, DST SRCALPHA)

Normally on FP16 targets. I think the read-latency may be hidden behind the AO. Then it's one multiply + write vs. blend.

sebbbi · Oct 5, 2011

Ethatron said:
Currently I have:

a ) framebuffer read, calc AO, multiply, framebuffer write
vs.
b ) calc AO, framebuffer blend (ADD, SRC ZERO, DST SRCALPHA)

Normally on FP16 targets. I think the read-latency may be hidden behind the AO. Then it's one multiply + write vs. blend.

It really depends. If you are sampler (tex) or texture cache bound, the blender version seems to be the best choice (one less tex instruction). If you are alu bound, you will save one multiply instruction by using the blender. If you are BW bound, then both should be equally fast (blender needs to read the framebuffer pixels just like you need in your shader, and there's equal amount of writes in both cases). And if you are fill bound (the shader has just a few alu and tex instructions) then both should be equally fast (performance is limited by the fill rate). I don't know that much about specifics on newest low end PC graphics hardware (esp Intel GMA/HD), but it could be that some hardware has some extra penalties in blending (for example halving the fill rate - of course this doesn't matter if your shader is complex enough and is not fill bound at half rate). On Xbox you should definitely use the blender (blending to EDRAM is free).

Ethatron · Oct 6, 2011

So our little consensus is that utilizing (supposely) FF-functionality has a better best-case but no worst worst-case?

Edit: Awesome, I did it with HDAO as a proof-of-concept (it's one-pass, perfect). 1.34ms with blending, 2.02ms with regular i/o. Damn, what a fast AO. I emulate cross-shape gather via two Fetch-4 fetches, so it better be quick.

Setup is 1400x1050, FP16 target, INTZ depth, with alpha-test, Fetch-4.

OpenGL guy · Oct 6, 2011

Ping-pong doesn't work well with overlapping geometry as you won't be able to blend with things heading down the pipeline. That's a benefit of the fixed-function path and it guarantees ordering too.

sebbbi · Oct 6, 2011

Ethatron said:
So our little consensus is that utilizing (supposely) FF-functionality has a better best-case but no worst worst-case?

It depends what you are looking for. The fixed function blender is pretty limited, it can only have two 4d inputs (existing pixel and the output from the pixel shader) and has very limited operations it can perform to those inputs. If you could feed it two outputs from pixel shader (fma blend mode) or if it could do dot products it would be much more useful for many tasks. Fully programmable blender would open up a lot of new opportunities (or having the possibility to get the existing render target color as input to pixel shader could of course make the blender completely useless, as every wanted blend mode could be implemented directly in the pixel shader). But this would be pretty hard to implement efficiently in the hardware (lots of new dependencies if ordering must be preserved).

mczak · Oct 7, 2011

sebbbi said:
It depends what you are looking for. The fixed function blender is pretty limited, it can only have two 4d inputs (existing pixel and the output from the pixel shader) and has very limited operations it can perform to those inputs. If you could feed it two outputs from pixel shader (fma blend mode) or if it could do dot products it would be much more useful for many tasks. Fully programmable blender would open up a lot of new opportunities (or having the possibility to get the existing render target color as input to pixel shader could of course make the blender completely useless, as every wanted blend mode could be implemented directly in the pixel shader). But this would be pretty hard to implement efficiently in the hardware (lots of new dependencies if ordering must be preserved).

I just don't see the blend unit getting fully programmable - it will start to look like a shader alu sounds like a waste. Getting the render target as input into the shader alus sounds like a much more forward-looking idea, even if it looks like quite some effort to keep operations ordered and the data coherent.

sebbbi · Oct 7, 2011

mczak said:
I just don't see the blend unit getting fully programmable - it will start to look like a shader alu sounds like a waste.

Pixel shaders in current games are really complex (most are 50+ instructions). The fixed function blender does just two or three ALU ops. Adding those ops to the end of the pixel shader when blending is needed wouldn't be a huge cost. And we could save all the transistors needed for the current fixed function blender. The bigger problems are the datapaths, latency and ordering. ALU shouldn't be a problem anymore (and even less in the future, as ALU performance has always increased lot faster than memory bandwidth).

Simon F · Oct 7, 2011

mczak said:
I just don't see the blend unit getting fully programmable .

In some devices, the "blend unit" is fully programmable in that it is just part of the shader.

mczak · Oct 7, 2011

Simon F said:
In some devices, the "blend unit" is fully programmable in that it is just part of the shader.

Yes SGX doesn't have a problem getting the render target color into the shader alus...
So I think "more traditional gpus" will solve the problem with the render target data getting into the shader rather than trying to make the blend unit itself more programmable (hence the "blend unit" won't really get programmable it will just disappear at the hardware level).

Humus · Oct 9, 2011

Arun said:
alphatest *is* discard ever since the G80/R600 generation. It's just software emulation.

Do drivers optimize the shader to put the discard as early as possible, or just dump it at the end? If the latter, building it into the shader should be faster. Same with discard vs z-test, z-test happens after the shader, discard can happen very early in shader.

OpenGL guy · Oct 9, 2011

Humus said:
Do drivers optimize the shader to put the discard as early as possible, or just dump it at the end? If the latter, building it into the shader should be faster. Same with discard vs z-test, z-test happens after the shader, discard can happen very early in shader.

Z rejection can happen before the shader in many cases.

sebbbi · Oct 10, 2011

OpenGL guy said:
Z rejection can happen before the shader in many cases.

In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once). PowerVR of course does things very differently...

Humus said:
Do drivers optimize the shader to put the discard as early as possible, or just dump it at the end? If the latter, building it into the shader should be faster. Same with discard vs z-test, z-test happens after the shader, discard can happen very early in shader.

I have managed noticeable performance improvements by using dynamic branching with discard. I detect the discard condition as soon as possible, and if the pixel is discarded I dynamic branch out the rest of the instructions. It doesn't seem to matter where you put the discard instruction, discard seems to only put a flag on to cull the pixel after the pixel shader is complete. Drivers could optimize the shader by doing the dynamic branch automatically, but it would require cost/benefit analysis, and you never know the shader usage patters at compile time. For example dynamic branching with discard improves our tree branch rendering very much (so many pixels that are clipped), but it would decrease performance on a highly repeating checkerboard clip pattern (branching granularity is not fine enough).

OpenGL guy · Oct 10, 2011

sebbbi said:
In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once). PowerVR of course does things very differently...

PowerVR isn't the only exception. See page 99 of the R300 register spec.

silent_guy · Oct 10, 2011

sebbbi said:
In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once).

I thought both AMD and Nvidia have something like early Z rejection? If the pixel shader doesn't modify the Z (and some other boundary condition wrt alpha, I suppose) then it should be able to drop pixels completely if it can be declared up front that it won't be visible.

(See here: http://www.slideshare.net/pjcozzi/z-buffer-optimizations)

I don't know to what extent modern graphics fall within those boundary conditions, though...

Arun · Oct 10, 2011

sebbbi said:
In most architectures per pixel/sample z/stencil-rejection occurs after the shader. Only more coarse hi-z rejection happens before the shader (and it culls bigger blocks of pixels at once). PowerVR of course does things very differently...

That was true for the GeForce 3 and Radeon 8500 iirc, but things are very different nowadays and as OpenGL Guy said, even the R300 could do the fine-grained Z-Test systematically before the pixel shader if allowed to.

Now that I think about it I wouldn't be surprised if NVIDIA only added that in the G80 timeframe, although it's been too long for me to trust my memory on that... And that reminds me I wonder how Tegra handles it - apparently it can reject 4 pixels per clock, but is that a tile-aligned quad or just four arbitrary pixels like in a very thin triangle (excluding quad-based shading inefficiencies)? Not that it really matters in the end with that kind of granularity.

Also interesting info on how dynamic branching helps discard/alpha test performance, I must admit I never thought about that, cheers!

Ethatron · Oct 10, 2011

Arun said:
Also interesting info on how dynamic branching helps discard/alpha test performance, I must admit I never thought about that, cheers!

Yes, in the case of my AO experiment I discard-branch fe. the sky. If I'd use a branched return-a-value it'd be slower (DX9 has no return-a-value branch-out, it's doing the full shader, then conditional). The area of the sky is often big enough to short-cut the entire wavefront I assume, all of them seem to branch out as intended. I sped up HBAO by 25% this way (discard + blend + alpha-test). HBAO is also tex-fetch limited, so a branch should reduce cache-trashing in this specific case.

Fixed vs. programmable

Ethatron

rpg.314

Davros

Arun

Unknown.

Ethatron

sebbbi

Ethatron

OpenGL guy

sebbbi

mczak

sebbbi

Simon F

Tea maker

mczak

Humus

Crazy coder

OpenGL guy

sebbbi

OpenGL guy

silent_guy

Arun

Unknown.

Ethatron

Similar threads