I'll answer instead
. Well these diagrams were leaked before already, so there's nothing new to see here.
I don't even know where to start describing all the differences - I don't think I'm alone not having expected that many changes.
Yeah, talk about a deep clean.
I think it's fair to say that, apart from doubled-Z in the RBEs, there was a general expectation that RV770 would increase the counts of certain items and that would be it. More ALUs were guaranteed, more TUs were the popular choice and apart from slightly increased clocks, that was pretty much the end of it.
Ha, I lead the pessimists' charge, asserting continuation of 16 TUs + 16 ROPs - though not without question.
For a change that isn't architectural, RV770 is a pretty thorough refresh. Actually trying to define what makes for architectural changes would be tough in light of this
I think it was Dave Orton who said that they didn't have the tools they needed to make R600. Maybe that was just post-justification, or maybe it indicates they knew what they couldn't achieve in the R600 timeframe. Much like the uncertainty over which features intended for D3D10 got pushed back into 10.1, I don't think we'll ever know how much of RV770's changes were pushed out of R600.
If R600 was released on-time then we'd be looking at a 18-20-month refresh period between it and RV770. That length of time could be taken to indicate that RV770 is much as planned and it
does not consist of stuff that couldn't make it into R600 due to "tools problems".
1) tmu organization. No longer shared across arrays, each array has its own quad tmu. I'll bet the sampler thread arbiter had to change with that as well.
Assuming that RV770 uses screen-space tiling for fragments, then this means that each quad TU now "owns" a region of the screen, since each SIMD owns a tile and there's a 1:1 relationship between the two.
I think R6xx uses the ring bus to allow the disjoint L2s (and therefore L1s) to share texels, after any one TU has fetched the texel from memory. But I don't remember anyone saying that this is the case. If true, this is extra work for the ring bus. I presume the ring bus also supports the SIMDs in fetching from "foreign" TUs, attached to other SIMDs, since all SIMDs have to use all TUs to get texture/vertex data.
As far as I can tell R6xx's TUs (L1, L2) each have a local ring stop. This ring stop serves the TU, an RBE and an MC. Connecting them is a crossbar. So not all memory operations by TUs and RBEs travel around the ring, as the local MC is "directly connected".
RV770 has a dedicated crossbar twixt L2s and L1s to enable texel distribution. But due to screen-space tiling, the volume of texels that need to land in multiple L1s should be much less than in R6xx. This is because texels at the borders of screen-space tiles are candidates for multi-L1 sharing, whereas in R6xx texels in every quad of screen space could be candidates for multiple L1s.
Vertex data normally consists of one or more streams (1D) that are consumed at roughly equal "element frequency". So it would seem to make sense for there to be a single vertex data cache as in RV770. It's not clear if R6xx had multiple instances of vertex data cache (one per SIMD) though.
I'm still wondering how a single vertex data cache is going to support 10 TUs though. Perhaps the SIMDs take it in turns, strictly round-robin?
2) tmu themselves changed. While they always had separate L1 caches (I think - the picture is misleading), now the separate 4 TA and point sampling fetch units are gone (thus one tmu is 4 TA, 16 fetch, 4 TF instead of 8 TA, 20 fetch, 4 TF). Also, early tests indicate they are no longer capable of sampling fp16 at full speed (dropping to half and one quarter at fp32 IIRC).
I have to say I'm confused by the fp16 situation. The amount of design effort that went into making R600 single-cycle fp16, indeed the conversion of int8 texels into fp16 texels, makes me wary of accepting that they've reverted to an int8 setup. Waiting to find out more.
3) ROPs. They now have 4xZ fill capability (at least in some situations) instead of just 2. The R600 picture indicates a fog/alpha unit which is now gone, though I doubt it really was there in the first place (doesn't make sense should be handled in the shader ALU). The picture also indicate shared color cache for R600, I don't know if this was true however. Could be though (see next item).
Like the uncertainty over L2 texture cache, I'm unsure whether R6xx has a single colour buffer cache or multiple instances each dedicated to an RBE. I suspect the latter, since screen-space tiling makes colour (and Z and stencil) essentially private to an RBE.
I'm pretty sure that RV770's RBEs only use their local MC, whereas R6xx appeared to allow all RBEs to access all MCs. I guess this means a revised way of tiling Colour, Z and stencil data in memory. Though we've never really had much idea how earlier GPUs tiled render targets...
4) no more ring bus. Clearly with rv770 ROPs are tied to memory channels (just like nvidia G80 and up), and there are per-memory channel L2 texture caches. Instead of one ring-bus it seems there's now different "busses" or crossbars or whatever for different data (it's got a "hub", it's got some path for texture data etc.)
The hub appears to be for low-bandwidth (or low duty-cycle) data. This makes me wonder if we'll see the "unification" of two GPUs' memory as has been long discussed, for the X2 board.
I just don't see how there'll be enough bandwidth through the two hubs (one per GPU) to allow anything other than the transmission of completed render targets, i.e. AFR mode.
5) Other stuff like the local data store, read/write cache etc.
LDS is a big deal. I have a suspicion that AMD has configured this as a read-only share between elements in a hardware thread. I wrote my theory here:
http://forum.beyond3d.com/showpost.php?p=1179619&postcount=4340
Making it read-only means it's "collision free" and latency-tolerant. I reckon this means that thread synchronisation (in order to be able to share data across elements safely) becomes very cheap and a normal part of the Sequencer's task of issuing threads and load-balancing them.
I dare say it's notable that, like GT200, RV770 is lower-clocked. I reckon this reflects the process/yield/die-size scaling issues that lead AMD to a multi-chip GPU strategy.
Global data share seems fiddlesome, now that's asking SIMDs to cooperate, I presume. Though if SIMDs are cooperating in their use of vertex data cache (taking it in turns) perhaps there's a higher level thread synchronisation mechanism in RV770. Something more interesting than the mundane creation and termination of hardware threads by the command processor.
Jawed