Why isn't programmable blending more widely supported? (History and discussion)

Since the early days of hardware-accelerated 3D graphics, GPUs could perform a small set of blending functions while drawing polygons: Add, Subtract, Min, Max, etc. with selectable source and destination factors.

What I find strange is that blending has not evolved beyond primitive functions. The first GPUs with programmable pixel shaders were released over 15 years ago, and yet the newest versions of OpenGL, Direct3D and Vulkan lack programmable blending.

Programmable blending seems simple: Just let pixel shaders read the last framebuffer value. It's not as if nobody ever thought of this before. In the GLSL 1.10 specification, there is this issue:

23) Should the fragment shader be allowed to read the current location in the frame buffer?

DISCUSSION: It may be difficult to specify this properly while taking into account multisampling. It
also may be quite difficult for hardware implementors to implement this capability, at least with
reasonable performance. But this was one of the top two requested items after the original release of
the shading language white paper. ISVs continue to tell us that they need this capability, and that it
must be high performance.

RESOLUTION: Yes. This is allowed, with strong cautions as to performance impacts.

REOPENED on December 10, 2002. There is too much concern about impact to performance and
impracticallity of implementation.

CLOSED on December 10, 2002

A whole decade later, programmable blending isn't standard. If we want it, we are forced to use hacks involving auxiliary framebuffers or atomic loads/stores.

It seems like manufacturers are slowly, slowly catching on to the fact that programmable blending can be useful and valuable to developers. Recently, the OpenGL extensions GL_ARM_shader_framebuffer_fetch and GL_EXT_shader_framebuffer_fetch appeared. These extensions offer a "gl_LastFragData" variable to pixel shaders with some caveats: Multiple Render Targets may not be supported, depth and stencil data is not available without another extension, and floating-point buffers are not supported.

The most obvious problem with programmable blending performance stems from how the GPU cannot exploit mathematical properties of the blending function. For instance, there is no way to tell if blending is commutative or associative, so if a group of triangles are submitted, and some of the triangles overlap in framebuffer space, blending correctly requires rendering each triangle one-at-a-time. This could be mitigated by extensions like GL_INTEL_fragment_shader_ordering which gives shaders more control over the order of memory operations.

I feel that standard-writers cannot ignore programmable blending. Developers want to have more flexible blending capabilities instead of being forced to use hacks and workarounds. Plus, framebuffer-fetch is part of Apple's Metal API, meaning it is already a part of millions of devices on the market today.
 
My understanding has always been that framebuffer fetch is undesirable because it inserts a dependency that can be fairly high latency; the read can't happen until all older pixels that can write to the framebuffer have written to the framebuffer. And this means the hardware has to support this type of dependency tracking, although I don't know if this is any real problem with all of the other synchronization types the GPU needs to support. There's a similar issue with the depth buffer and fragment reject but since that's such an entrenched feature GPUs had to support it, albeit not without a big performance cost.

By having blending as a separate function it can be serialized in a way that's decoupled from the fragment shader and pipelined in a way that's as low latency as possible. It seems to me like it'd be more preferable to have a different type of shader for blending than relying on doing it in the fragment shader using framebuffer fetches.

EDIT: There's a really good description of the issues here:

https://fgiesen.wordpress.com/2011/07/12/a-trip-through-the-graphics-pipeline-2011-part-9/
 
Last edited:
My understanding has always been that framebuffer fetch is undesirable because it inserts a dependency that can be fairly high latency;
I thought some architectures append code at the end of the pixel shader to perform blending? Also why would a memory fetch from the render target be any different from any other memory access in a shader... shouldn't the massive parallelism of the GPU hide this latency?
 
I thought some architectures append code at the end of the pixel shader to perform blending? Also why would a memory fetch from the render target be any different from any other memory access in a shader... shouldn't the massive parallelism of the GPU hide this latency?

If you hadn't read the link I posted earlier I highly recommend it, and if you can also check out the comments, they're very informative.

The only architecture I know of that definitely compiles complete blending in the fragment shader is GeForce ULP which was used in Tegra 1-4. It was a pretty outdated architecture even compared to other mobile GPUs released around the same time. You can read more about it here: http://www.nvidia.com/docs/IO/116757/Tegra_4_GPU_Whitepaper_FINALv2.pdf

The reason why framebuffer fetches are a bigger problem than other types of memory reads is that the same fragment shader also writes to the framebuffer. This isn't the case for conventional buffers like textures and when it can happen it's done with explicit barriers and within shared memories of limited size.

To support both reading and writing over several pixels at a time without introducing interlocks, it must be done with the following:

1) The pixels are partitioned in such a way that multiple reads and writes happen at different screen offsets and don't conflict with each other
2) The pixels that happen at the same offset are pipelined in lockstep so the reads and writes always happen in submission order

With fixed function blending in the ROPs it's easy to meet these two criteria. With programmable blending in the fragment shader #2 starts to introduce big efficiency problems. Basically, it means all of the threads attached to a ROP need to be coherent. So they have to be running the same part of a shader at the same time. They can't run different shaders (at least not if they read and write pixels), they can't be individually stalled, they can't diverge on branches or loops and so on. All of the operations have to take the same latency or when a stall happens (like on a cache miss) the whole group of threads has to be stalled together. Really the only thing you can do to hide latency at all is have larger warps of the same shader, which starts to run into scalability problems.

For GeForce ULP this probably wasn't a huge deal. It's not unified so the vertex shaders don't dynamically compete with the fragment shaders. There are only 48 ALUs total which isn't a lot of parallelism, and that's divided into 3xVLIW4 for each of the four L1/ROP clusters, so it probably doesn't get great utilization anyway. From what I recall it has no support for dynamic flow control so divergence isn't an issue. Still, by Tegra 4 the design was really showing its limitations and I believe scaled beyond the limits of practicality, hence why it was scrapped soon afterwards for Kepler.

The other thing you can do is introduce interlocks so stalls only occur when the framebuffer reads actually do interfere with older writes in progress. The issue with this is that on conventional renderers it has to track either all of the active pixels in parallel or every possible pixel in the framebuffer, both of which are big burdens for anything with much parallelism.

So I don't think GeForce ULP did interlocks. But it is probably used on the mobile tile-based renderers that implement the framebuffer read extension. It's less of a burden here, especially if the tiles are small, because there's a much smaller render target area within the tile that has to be tracked, and because the tile has a guaranteed low latency read and write. Despite this I expect that these GPUs still implement fixed function ROP blending and that framebuffer fetches are only very performant if you can guarantee that the interlocks don't trigger much. So they're useful as an alternative to using render-to-texture when you're iterating over the framebuffer in place, like for processing a G-buffer. But they're probably not that good for actual blending.
 
AMD GPUs have always had separate ROP caches that read/write directly from/to memory. These caches are not coherent with the main L2 cache. Pixel shader sends data to ROPs (in fire-and-forget manner, there's no data path back). The driver (DX11/OpenGL) needs to flush the ROP caches when you bind the render target as a texture. In DX12/Vulkan you explicitly do a resource transition barrier to perform this task. There's no guarantee that ROP data is visible in memory before the ROP caches are flushed. Thus memory loads (or texture sampling) from the render target could potentially return stale old data, or a combination of new and old data.

Metal on iOS is simple, because it only needs to support PowerVR TBDR GPUs. Triangles are binned to tiles, and each tile is rendered to on-chip tile buffer. Previous value of each render target pixel in the tile is guaranteed to be available on the on-chip tile buffer. This simplifies things a lot and makes programmable blending highly efficient.

Vega is going to move ROP caches under L2 cache. Vega marketing diagrams still show L1 ROP caches. We don't yet know whether these new L1 ROP caches have coherency with the other caches, or whether you still need to flush them before reading a RT. I'd assume that ROP caches are inclusive to L2, meaning that a flush would simply update L2 cache according to L1 ROP cache lines. In this case the flush incurs no memory traffic, and would be much faster than ROP cache flushes on current GCN. Also there would be no need for L2 cache flush (which is by far the slowest operation you can do... except for waiting for full GPU idle).

AMD hasn't yet disclosed whether Vega supports DX12.1. This would mean support for rasterizer ordered views. This feature allows doing programmable blending on DirectX. Both Nvidia and Intel support it. If AMD joins the party, we finally have standard programmable blending on all 3 major PC GPU IHVs. But ROVs aren't exactly programmable blending, since you need to use UAV instead of a render target. But that's semantics.
 
Last edited:
You can implement the behviour of programmable blending already by operating on a UAV instead of an RTV. The whole portion of the code which implements the blending is clamped by a atomic compare exchange, and the blending is repeated with the new input. It would be awefully slow in comparison with a good hardware based solution. But's a good starting point to look into how this could be approached.
 
You can implement the behviour of programmable blending already by operating on a UAV instead of an RTV. The whole portion of the code which implements the blending is clamped by a atomic compare exchange, and the blending is repeated with the new input. It would be awefully slow in comparison with a good hardware based solution. But's a good starting point to look into how this could be approached.
This doesn't work in the common case. UAVs have no guaranteed ordering (that's the U=unordered in the name). If multiple pixel shader instances write to same address using UAV, the order of UAV writes is not guaranteed by any means. Each atomic compare exchange is of course atomic, but the triangle submission order isn't respected. In comparison ROP based blending is guaranteed to happen in triangle submission order. ROV extends UAV writes to guarantee triangle ordering.

You can sidestep the issue by allocating more space (or a linked list) per pixel. But in this case you need to record all pixel shader results and have a separate resolve pass to blend them in proper order. The order might be completely random during rasterization, so you can't assume that no new data gets blended between two samples. So there's no way to reduce the amount of data during rasterization if you want to be 100% sure of deterministic output. Thus perfect emulation requires X*Y*N storage, where's N is the average overdraw factor. ROVs are the solution to this issue.
 
This doesn't work in the common case.

I was thinking of a pure software solution, doing the ordered re-try: waiting for a global counter value reaching the "successor" id. It's just a model to look at, nothing which is fast.
 
I was thinking of a pure software solution, doing the ordered re-try: waiting for a global counter value reaching the "successor" id. It's just a model to look at, nothing which is fast.
Waiting isn't safe in HLSL. Potential deadlock. There's no guarantee that the compute unit doesn't spend all remaining GPU time in running your wait loop, starving everything else, including the thread that you are waiting for.
 
Waiting isn't safe in HLSL. Potential deadlock. There's no guarantee that the compute unit doesn't spend all remaining GPU time in running your wait loop, starving everything else, including the thread that you are waiting for.

Damn, that's bad, no yield(). Wouldn't it make sense as a IHV to guarantee periodical yielding?
In the end it's always possible to use OIT mechanisms. But that's a real working algorithm, not a "software" model which helps to find ideas for chip improvements/features. It's also not memoryless, and I would very much like to think of a stateless/memoryless solution.
 
Damn, that's bad, no yield(). Wouldn't it make sense as a IHV to guarantee periodical yielding?
In the end it's always possible to use OIT mechanisms. But that's a real working algorithm, not a "software" model which helps to find ideas for chip improvements/features. It's also not memoryless, and I would very much like to think of a stateless/memoryless solution.
There's is a memoryless solution. It is called rasterizer ordered view (ROV) :)
 
There's is a memoryless solution. It is called rasterizer ordered view (ROV) :)

Yeah, you can have 1 of them, very nice. ;) Probably because it's not stateless, or not enough. At least it's a solution which doesn't allow you to get away with unbounded number of ROVs in descriptor tables (like UAVs), or just eight (RTVs).
 
Back
Top