ROPs would be very latency intolerant if the atomic ALU operations were done in the PS shader kernel (like Larrabee?). With separate ROP units, perhaps latency intolerance has to do with needing to maintain correct fragment ordering.
The wrinkle here is trying to discern whether that's talking about ROPs purely for graphics operations being latency-intolerant or whether it's including the handling of atomics from shaders. I think it's purely graphics for what it's worth.
In that context basic scoreboarding of fragment issue and queuing of completed fragments for correctly ordered updates is all that's required as far as I can tell.
Larrabee style ROP latency is effectively bounded by L2 latency, 10 cycles in absolute terms, a trivial amount across 4 threads.
ATI has colour, Z and stencil buffer caches. What we don't know is the typical lifetime of pixels in them or associativity
It's interesting that the newest programming guide (for the Linux driver-writing community) suggests that for short shaders late-Z should be used, not early-Z. I guess this means that there's no point bottlenecking the setup/interpolator units (i.e. there's latency in early-Z processing between setup and interpolation) and letting the RBE work it all out. If that's the case that would imply a fairly meaty bit of caching for these pixel caches. But maybe I'm missing something...
Maybe it just means that the early-Z unit creates too much extra work for RBE for this to be worthwhile, so better to take a single-shot approach to Z, not dual-shot. That would imply either nothing in particular about caches or it might imply that it's the caches that would be straining
Kind of stuff you'd need to simulate I guess.
Could get very interesting. I wonder if it would ever be a possibility for NVidia to go beyond just one compute chip to one memory/ROP hub chip. For instance pair two smaller higher yield compute chips with a single MEM/ROP HUB.
I see compute chips sharing like that as relatively unlikely, simply because bandwidth is still pretty important (thinking small scale configurations here).
But it would be interesting if a mesh of compute chips and hubs was formed creating a shared/distributed memory space of some type. In a sense it's only a strong-link version of SLI. But I suppose there'd be options to make the routing more intricate, e.g. with hubs shared by pairs of chips in a ring and all that other good stuff that Dally dreams about at night.
Jawed