I assume then that writes by the ROPs to memory are also issued centrally? At least, that's how it looks like on the toplevel memory architecture diagram.
If that's the case, then you could summarize by saying: everything is handled centrally except for the read data flow.
Reads consume more bandwidth than writes, I suspect - which is presumably why the architecture is biased this way. Texturing is read only. Generally you can't write a pixel to a render target without testing Z/stencil first (though there are exceptions).
And this is what I think doesn't make sense: compared to the issue side of an MC, the return side is trivial and definitely not a hot spot.
The difficult part of an MC is in deciding what to send when to memory. Once the transaction is on its way, it is just a matter of going with the flow and making sure that you never ever have to backpressure the SDRAM, because that's not allowed.
I think you're not allowing for the crossbar. Somewhere in a GPU data to/from one memory channel has to be routed from/to one of many clients, in parallel with all the other clients.
In G80 it would appear that the crossbar is twixt the shader clusters (and their L1 cache) and the ROP/MC units (and their L2 cache). Each ROP, it would appear, has a dedicated memory channel, so its reads and writes never pass through any kind of crossbar. This also means its MC is reduced in complexity, because it only has one client to deal with, the L2 cache (which supplies L1 with texture data, and presumably also caches all colour/z/stencil buffer operations, whether read or write).
But NVidia still has to implement an 8x6 crossbar within G80 to move data between L1 and L2. At least, that's apparently what's going on based on the architecture diagrams - though it appears as a simple bus. NVidia has successfully removed ROP data from traversing the crossbar (is G80 the first of its GPUs to do this?), so it's only read texture data, early-Z data reads, shaded pixels and computed zixels/samples that contend for the crossbar.
I don't buy the argument that the ring bus has been introduced to prevent hot spots. It's a simple store-and-forward transportation mechanism.
OK, well, what would happen if R600 was based upon a 512-bit internal crossbar? How complex would that be? I dunno, but it seems to me that it's a fairly serious bit of kit, routing data between say four shader clusters (including ROPs) and 8 MCs.
And the ring stop that feeds the ring is nothing more than a sofisticated FIFO with some decision logic based on a destination tag (and, maybe, the refresh logic for the SDRAM, a few counters) It's trivial in terms of area (and power) compared to the central logic (as is clearly visible on the R520 die shot.)
I'm hardly going to argue this point. But simple is good, no?
Second, I don't see how the MC, of all things, would be the one that's critical with respect to hot spots. If that's the case, how are you ever going to solve the problems for the shaders?
Each sub-MC may have a whole bunch of inputs from different agents, but in the end the individual throughput is simply the maximum throughput of the SDRAM: you can have bursty periods where a whole bunch of parallel agents are requesting data from the same SDRAM, but on average, only 1 transaction can leave the sub-MC and only 1 can enter. It's a simple as that.
Confused what you mean by SDRAM?: on-die RAM whether caches or buffers or off-die GDDR video memory?
As far as I can tell the MC, as a whole, implements weighted scoreboards and per-agent queues. I dunno how complex the MC is, but it's at least partly programmable by the driver (even if that's nothing more than setting some registers) and adaptive.
The sub-MCs can't work independently, they need to be overseen. But I've no idea of the complexity/functionality there.
The point is: an MC is large, but the amount of nodes that are toggling is low. In a well-fed ALU, a ton of nodes are toggling all the time. Which part do you think is going to consume the most power? And there's no way you're going to spread a quad around.
Well it beats me, why is R520's MC over
30M transistors (or, at least, ~10% of the die area), excluding the ring bus? What the hell is it doing? There's more area dedicated to the MC in R520 than shader ALUs! The ALUs and register file together, in R520, amount to ~32M transistors
http://www.beyond3d.com/forum/showthread.php?t=37616
I don't believe for a second that the MC is more troublesome wrt hot spots than the shaders.
It isn't, in R5xx. But if instead of R5xx's current architecture, a crossbar and a monolithic 256-bit MC was implemented, how would R5xx perform?
Look at all the CPU's out there: running at much higher speeds and all the power hungry logic very concentrated. If they can solve it, why would a GPU have problems?
Which CPU has a 256-bit (or 512-bit) bus to memory? Which CPU has such a complex crossbar? Why does Cell have the EIB instead of a crossbar, even though, theoretically a crossbar would perform better?
Wires don't consume power. High routing density automatically results in a lower cell utilization ratio. How exactly does that increase the chances of a hot spot?
I presume in the sense that a high density of wires implies a high density of FIFO circuitry and the associated arbitration logic. For example, I presume in X1950XTX with 2GHz effective GDDR4, that these wires are humming at 2GHz which means the FIFOs are also humming at 2GHz, but I don't know...
I said it before: the ring bus in interesting technical curiosity that some reason tickles the imagination of a lot of GPU fans, but in the grand scheme of things it's only little more than an implementation detail to reduce routing congestion...
And presumably the routing complexity associated with a 512-bit bus is going to be more than double that associated with a 256-bit bus. It doesn't seem to be a trivial problem to solve.
It's notable that NVidia chose to solve this problem differently, as I described earlier, by reducing the number of clients using the crossbar and simplying the number of agents that the MC has to cater for (basically, L2 cache). NVidia's managed to cut congestion by cutting down on the number of allowed routes and clients using those routes.
Jawed