Fixed vs. programmable

I thought both AMD and Nvidia have something like early Z rejection? If the pixel shader doesn't modify the Z (and some other boundary condition wrt alpha, I suppose) then it should be able to drop pixels completely if it can be declared up front that it won't be visible.
HiZ is an early Z optimization technique. It occurs before the pixel shader. If you wanted to do per pixel early Z culling, you would need to read the Z-buffer pixels from the memory before spawning the pixel shader threads. HiZ on the other hand is just a small optimization structure. It doesn't require memory accesses (can be kept on chip) as it has coarse resolution and coarse bit depth. I don't know if new PC hardware has other early Z functionality besides HiZ. It's entirely possible, if the chip can hide the latency of fetching depth values before executing the pixel shader. But if there's a mechanism for perfect 1:1 pixel based early depth culling, the same mechanisms could be likely adapted for programmable blending as well (fetch both pixel depth and color before spawning the pixel threads, and feed the color as pixel shader input = programmable blending).

Pixel shader calculates pixel alpha value, so early alpha culling is not possible (you always need to run the shader). Stencil and depth are the only values that are known before the pixel shader (unless pixel shader writes to oDepth).
 
Last edited by a moderator:
Z rejection can happen before the shader in many cases.

Of course. My understanding of the OP was that we were only talking about having a z value available in the shader, so no ROP level culling.

Now that I think about it I wouldn't be surprised if NVIDIA only added that in the G80 timeframe, although it's been too long for me to trust my memory on that...

I could be wrong, but I would say yes on that, based on that the RSX only has coarse culling on Z and S. On top of my head I think it's the same for Xenon as well.

It doesn't seem to matter where you put the discard instruction, discard seems to only put a flag on to cull the pixel after the pixel shader is complete.

Many HW detect if all pixels in a quad are dead, and in that case skip subsequent texture fetching. So while all instructions are still executed, it eases the pressure on the texture cache and memory. Also there will typically be no writes on the backend either.
 
HiZ is an early Z optimization technique. It occurs before the pixel shader. If you wanted to do per pixel early Z culling, you would need to read the Z-buffer pixels from the memory before spawning the pixel shader threads. HiZ on the other hand is just a small optimization structure. It doesn't require memory accesses (can be kept on chip) as it has coarse resolution and coarse bit depth. I don't know if new PC hardware has other early Z functionality besides HiZ. It's entirely possible, if the chip can hide the latency of fetching depth values before executing the pixel shader. But if there's a mechanism for perfect 1:1 pixel based early depth culling, the same mechanisms could be likely adapted for programmable blending as well (fetch both pixel depth and color before spawning the pixel threads, and feed the color as pixel shader input = programmable blending).
The chips no longer have HiZ on-chip memory (not quite sure about nvidia, but since ati r600 and also intel IGP which didn't have HiZ before) but indeed instead store the HiZ data alongside the normal depth data - the obvious disadvantage being that this could indeed probably increase bandwidth needed for depth reads/writes in some cases, but you no longer have problems switching depth buffers and the like. But I believe all current chips indeed do fine-grained early Z culling - you can also program them to do z tests both before and after the pixel shader depending on stencil/z (and shader) state. There are some optimization notes even in some amd manuals that early-z might not always be a win if the shader is short and other things. And it's not just early z reads it's also writes - there's even a "auto" bit somewhere which will decide when z tests will run exactly depending on the bound shader (so depending on if the shader does z writes and or discard).
 
Also there will typically be no writes on the backend either.

Which can be logged with a stencil-buffer which then can lead to additional performance-improvements. Unnessessary to mention stencil buffers are FF as well.
 
I am still living in the world of 2005 GPUs :D

But the hardware in the 2005 consoles was actually pretty modern in many ways. Memexport is more flexible than DX10 stream out. We had to wait for DX11 compute shaders to have something exceeding it. PC CPUs with vector FMA haven't yet even been released (Bulldozer and Haswell are first to support it). And PC received many other nice CPU instructions much later than current generation consoles. I just hope we will get as much forward looking console hardware in next generation. I would want to get my hands dirty with 512 bit AVX / gather and to finally do something productive on high end compute shaders (and not only some sorting/data processing algorithms for research purposes).
 
... and to finally do something productive on high end compute shaders (and not only some sorting/data processing algorithms for research purposes).
The majority of the Battlefield 3 lighting/shading pipeline doesn't count as something "productive"? :)

Agreed that the consoles got a few things right early on, although they made a bunch of big mistakes too. Static (and poorly chosen) EDRAM sizing and non-cached SPU local memory are two examples. Both of those memories would be a hell of a lot more useful as part of a cache hierarchy, and that's hardly impossible to achieve even in the 2005 time-frame.

Overall though they have aged fairly decently. If I had to pick on one piece of hardware it'd be RSX for being behind the times even when it launched.

Back on topic though, I don't think some of the comparisons of "fixed" vs. "programmable" that are being made are really meaningful or fair. It's not really black and white... everything has some element of programmability and associated flexibility, all the way down to 32-bit ALUs. You can't lump high-level features into two categories and come up with any useful conclusions.
 
Last edited by a moderator:
Back on topic though, I don't think some of the comparisons of "fixed" vs. "programmable" that are being made are really meaningful or fair. It's not really black and white... everything has some element of programmability and associated flexibility, all the way down to 32-bit ALUs. You can't lump high-level features into two categories and come up with any useful conclusions.
Yes, and it's very convenient (and highly efficient) to have fixed function texture sampling units (bilinear filtering, mipmap filtering, mipmap calculation, anisotropic filtering, etc). Calculating mip levels and blending factors in pixel shaders would cost a lot of programmable ALU cycles, and increase on-chip data movement (need to transfer four samples to ALU for bilinear filtering instead of one that is already filtered from four samples).
 
Not really.
Also d3d10 adds more fixed function stuff such as alpha to coverage.
alpha to coverage isn't really new it's part of one of the oldest OpenGL ARB extensions in existence already :) Though granted d3d9 had no standard way to expose it.
You are correct though that current graphic chips indeed still have alpha test functionality. The reasons for it escape me though - can't be exposed by d3d10 and could easily be emulated by discard for d3d9/ogl (wouldn't that be faster anyway potentially with todays gpus if you don't have to always run the whole shader?). It saves you the recompilation of a shader if alpha test state changes, but gpu drivers can do that for other reasons (like changing fog state) anyway.
 
More interestingly perhaps, Tegra already supports fully programmable blending with an OpenGL extension - there's no HW blender, it's all done in the pixel shader no matter what.
I guess I'll just leave this here for the convenience of the OP (tegra extensions discussed at the end).
 
Sorry for the thread resurrection but I have a related question. Are int8 render targets still in widespread use or are most intermediate buffers fp16 nowadays? The reason I ask is that the inevitable move to full speed fp16 blending seems like a good time to consider performing that function on the main alus.
 
Sorry for the thread resurrection but I have a related question. Are int8 render targets still in widespread use or are most intermediate buffers fp16 nowadays? The reason I ask is that the inevitable move to full speed fp16 blending seems like a good time to consider performing that function on the main alus.
AMD's ROPs are already capable of full speed FP16 blending, isn't it? Just the filtering in the TMUs is half rate (nV's TMUs can do that full rate while their ROPs handle FP16 @ half rate).
 
I think all XBox-ports are at least RGBA8 capable. I guess it's more a thing of compatibility, not to remove it instead of not trusting it's available. It complicates the pipeline (2 cases) though.
Otherwise when you don't do HDR in game why would you use FP16 anyway?

I was bugging the ATI driver-team to expose me FP10 targets via FOURCC in the next Catalysts (that was June), but I haven't got them. :) I'd love to get my 15% speedup in Oblivion back.
 
AMD's ROPs are already capable of full speed FP16 blending, isn't it? Just the filtering in the TMUs is half rate (nV's TMUs can do that full rate while their ROPs handle FP16 @ half rate).

Don't think so. They do full speed FP16 writes not blends.
 
FP16 RT blending in GCN is still half-rate, as in any other contemporary architecture. Write op's are full-rate for GCN, but still can't peak any near the INT8 rates.
Fermi is half-rate for both blends and writes.
 
Most probably.
Cayman is also capable of full-rate FP16 RT writes and considering its ROP configuration is 1:1 to GCN, the 50% less bandwidth available is almost cutting the rate in 2/3.
 
When looking at the tests hardware.fr did, I get the hunch that the lowered performance of FP16 writes (without blending) is a result of a less than optimal layout of the framebuffer in memory, or they can't stream out a complete frame buffer tile in a single rush or something along those lines.

img0034566sm57y.gif
img0034567xk4gi.gif


It is somewhat bandwidth limited on Cayman (about 165 of 176 GB/s is used for 4xFP16) but it is definitely not on Tahiti (it uses just 198 out of 264 GB/s). On the other hand, it is basically back at the bandwidth limit with blending and a used bandwidth of 232 GB/s, even while the blending (with the switching between read and write on the memory bus) should reduce the bandwidth efficiency slightly (and you see in the tests there that bandwidth limited fillrate with blending is usuall slightly less than half of it without blending).

Is this just some contention effect from the additional crossbar between ROPs and memory controllers in Tahiti? I would like to this fillrate benchmark with reduced core clock or increased memory clock, as the theoretical limit for halfrate FP16 blending at default clock would be 14,8 GPixel/s and Tahiti comes awfully close. While without blending it is quite a bit away from the theoretical limit despite that the bandwidth efficiency should be better (if the framebuffer layout is not suboptimal for FP16 targets).
 
The chips no longer have HiZ on-chip memory (not quite sure about nvidia, but since ati r600 and also intel IGP which didn't have HiZ before) but indeed instead store the HiZ data alongside the normal depth data - the obvious disadvantage being that this could indeed probably increase bandwidth needed for depth reads/writes in some cases, but you no longer have problems switching depth buffers and the like. But I believe all current chips indeed do fine-grained early Z culling - you can also program them to do z tests both before and after the pixel shader depending on stencil/z (and shader) state. There are some optimization notes even in some amd manuals that early-z might not always be a win if the shader is short and other things. And it's not just early z reads it's also writes - there's even a "auto" bit somewhere which will decide when z tests will run exactly depending on the bound shader (so depending on if the shader does z writes and or discard).
Even if the Hi-Z memory is not stored on chip anymore it can always be cached like any other data (z, color, etc.).
Moreover all current chips must be able to do early z rejection at the fragment level as in DX11 any shader can be tagged with the [earlydepthstencil] attribute.
 
Back
Top