frame buffer (color and z) emulation via shaders

Would it be better to have the ROPs and Depth/Stencil units replaced by shaders to emulate them?

What would be an optimum number and clock speed of CUDA cores and on-die cache be to emulate 32 ROPs/256 Z/stencil units @ 500 MHz? (fill rate of 16GPixels/sec?)

If it's not possible to achieve optimal TDP at 40nm with just CUDA cores and TMUs (Texture address and filtering units), then what about 20nm?

Finally, would ATi's shaders or nvidia's CUDA cores be faster at emulating DX11-spec'd ROPs and depth units?

Also, if there's something I'm misunderstanding, just point it and try to explain the best way possible. These questions may sound like they're coming from the noob I am, or may not even be in the right ballpark.
 
Laughabee tried it already -- FF graphics hardware is here to stay for a while. The compression/decompression of the data streams alone is still way faster done in dedicated hardware, let alone higher order functions like colour/depth testing, sampling and blending.
 
Laughabee tried it already -- FF graphics hardware is here to stay for a while. The compression/decompression of the data streams alone is still way faster done in dedicated hardware, let alone higher order functions like colour/depth testing, sampling and blending.
I think some of the rop functions indeed could be easily done in the shaders, that the rops have hw for that is just due to data flow reasons. In particular doing blending in the rops looks like a total waste to me as that's really just "ordinary" math easily be done with shader MADs (granted it might not be too expensive since the rops only do 8-bit blends at full speed but still).
 
It would take some serious research to answer most questions in this thread.

Mczak is correct that part of the problem with moving the ROPs into the shaders is a data flow problem. ROPs need to maintain order and perform read modify writes so there's buffering required to maintain performance. This buffering is going to be more expensive in the shaders because it's more general. I don't know of any research investigating the tradeoffs of dedicated ROP hardware vs. shader based ROPs.
 
Laughabee tried it already -- FF graphics hardware is here to stay for a while. The compression/decompression of the data streams alone is still way faster done in dedicated hardware, let alone higher order functions like colour/depth testing, sampling and blending.

Sampling/texture fetch will likely stay in hardware for the foreseeable future. Everything else is pretty much fair game. There really isn't anything complicated in what ROPs do. Honestly, I not sure they even have a performance advantage in this generations, and by next generation with better caching, etc, the area they consume will probably be better spend on more SIMD engines. The paper Intel published showed the ROP portion of the pipeline being fairly minimal, ~<5%.
 
Last edited by a moderator:
Game performance currently appears to be dominated by various forms of fillrate. Also there's still essentially no analysis of setup-rate bottlenecking for things like early-Z prepass.

ROPs in RV770 are, I estimate, 4.3% of the die. The ALUs, alone (including redundancy) are about 29%, while the entire central "cores" section is about 41% of the die.

So ROPs are fairly area-efficient in ATI. Though I think 8xZ per clock is overdue.

In any kind of processor that's implementing the D3D pipeline there's a requirement for atomics for render target operations. If you're doing Larrabee-style tiled-forward-rendering then the atomics are against L2 cache, so no big deal.

If you have FF ROPs though, atomics against global memory are essential (deliberately ignoring TBDRs, since they aren't competing in the desktop/ultra-high-performance realm in any meaningful fashion).

The tricky problem for something like Larrabee is that in the general case it needs fast atomics, whether for graphics or anything else. The tiled-forward-rending algorithm doesn't necessarily apply to these cases (i.e. TFR doesn't require cache coherency and doesn't make numerous trips to global memory per tile) and then Larrabee is stuck with whatever the cache architecture can deliver.

Whereas the FF style of GPUs is "good enough". A combination of queueing/re-ordering/compression/caching just about hanging in there.

In truth it seems pretty easy to bring a traditional GPU to its knees with shadow rendering or lots of blending. Hard to know how much of that is merely bandwidth and how much is fillrate.

Way too much emphasis on average frame rates for my liking.

Jawed
 
Last edited by a moderator:
Also there's still essentially no analysis of setup-rate bottlenecking for things like early-Z prepass.

Not yet published, you mean. It's time we did a status update about what's been going on at B3D (we're not completely dead, mind you). And sorry for the OT, the bit just seemed like a proper hook.
 
I think some of the rop functions indeed could be easily done in the shaders, that the rops have hw for that is just due to data flow reasons. In particular doing blending in the rops looks like a total waste to me as that's really just "ordinary" math easily be done with shader MADs (granted it might not be too expensive since the rops only do 8-bit blends at full speed but still).

ROPs are quite simple to get rid of, if you have a TBDR. Otherwise, they are a lot more work. However, I expect them to vanish sooner rather than later, as doing the job of ROPs from on chip memory is a lot more bandwidth efficient. And once you are on chip, doing it with ALU's makes so much sense.
 
Back
Top