The LAST R600 Rumours & Speculation Thread

R300King! · Jan 14, 2007

neliz said:
Jawed just called, he wants three hours of his life back

not a chance!

I think they'll be able to make all the connections on 65nm, they're going to have to. The mid-range parts will probably just be an R600 with some sections disabled and they usually bring us the mid/low-end on this smaller process first, usually.

Jawed · Jan 14, 2007

silent_guy said:
I think that one of the problems that could emerge is that the amount of area fully dedicated to IO alone will become too large. So the ratio IO/core will increase and the final die size won't scale optimally with the process. At that point, it becomes a discussion of economics.

Conversely, as the die shrinks with each new process, its demand for power gets lower - though I don't doubt there are hotspots of the die (e.g. IO) that still require high levels of power in order to ensure clean drive. I suppose another candidate for a hotspot is a portion of the die running at several GHz. This is where clock domains come in, I guess (at least partly, anyway).

So, with less power demanded (and at lower voltages), I suppose the count of the power/grounding pads decreases. Though since the die is significantly smaller (80->60nm is ~66%) there's less area available for pads. And with leakage and whatnot getting worse with smaller processes, I guess the shrinkage in power demand (therefore pad count) may not keep up with the loss of area through which power can be delivered.

It's notable, I guess, that ATI split memory interfacing into the centralised memory controller and ring stops to counter the heating problems associated with a crossbar type of memory controller. I suppose there's also a question of being able to deliver power - the two-part memory system lowers the power density required. Additionally, it must cost more, in terms of die area consumed, than a crossbar memory system. The payback is presumably being able to run the die (as a whole or at least in parts) at higher speeds, since concentrated heat is less of a problem.

Separately, I suppose, there's got to be a question of redundancy within the memory controller or IO. You don't really want the entire die to be a failure just because one ring stop has a minor fault. Indeed you could say that the bulk nature of memory controllers makes them more susceptible to this kind of failure. You can't afford to lose any part of any memory channel, because, for example, a 224-bit interface to memory is of absolutely no use whatsoever. Whereas pixel shader quads can be de-activated and sold as lower-grade chips. Though you could sell a 128-bit chip, I suppose...

Anyway, as the memory system gets bigger, it takes up potentially more and more of the die as a percentage, and redundancy becomes more critical.

Jawed

Kaotik · Jan 14, 2007

Summary of what someone posted at MuroBBS (Finnish hardware site's board)

He was eating dinner/lunch/something with W. Chen, Vice President of Strategic Sales, ex-ATI now AMD's graphics department.
There's 2 versions (like XT/XTX), both equipped with 1GB mem
Will be released in February
Claims that even high quality 430W PSU won't do it.
Faster than 8800GTX
512bit Membus
"Sorry, I can't comment" for the VR-Zone pic, with some laugh on it.
Radeon name will be kept

What makes me personally doubt it is the PSU claim, and some of it was "common knowledge" already anyway.

Arty · Jan 14, 2007

Well for comparison sakes, Nvidia recommends atleast a 450W PSU for the GTX and 400W for the GTS. So even that bit doesnt come off as a surprise to me.

Kaotik · Jan 14, 2007

serenity said:
Well for comparison sakes, Nvidia recommends atleast a 450W PSU for the GTX and 400W for the GTS. So even that bit doesnt come off as a surprise to me.

Small update on that, apparently he spoke of 500-600W minimum.

Arty · Jan 14, 2007

Kaotik said:
Small update on that, apparently he spoke of 500-600W minimum.

Even that looks like an obvious guess. Sorry for being so skeptical.

TG01 · Jan 14, 2007

serenity said:
Even that looks like an obvious guess. Sorry for being so skeptical.

Why is that so obvious..? It would mean a 50-150 extra watts of power for the GPU alone compared to the 8800GTX..
And besides for instance I could upgrade to an 8800GTX without having to upgrade my PSU..
If I were to consider going high end that bit could just turn me over to NV if the R600 was priced as high as the 8800(/8900)GTX

and no not all people going for high end graphics will go all the way and upgrade to a higher PSU as well. When I got my rig in november last year I got the (almost) biggest GPU possible by budgetting almost everything else (3500Xp/7800GT) The GPU came out to be 40% of total system cost.. well worth it at that time..

silent_guy · Jan 15, 2007

Jawed said:
Conversely, as the die shrinks with each new process, its demand for power gets lower...

I was just looking at the proportion of IO area vs. core area, irrespective of the number of power pads for the core. The absolute number of core power pads may indeed go down in 65nm, but the absolute number of IO power pads won't.
When keeping a 512-bit bus, the area dedicated only to IOs combined with the higher price for 65nm may be such that a 65nm die isn't economical anymore compared to 80nm.

... I guess the shrinkage in power demand (therefore pad count) may not keep up with the loss of area through which power can be delivered.

I don't think power delivery to the core is a problem at this time.

It's notable, I guess, that ATI split memory interfacing into the centralised memory controller and ring stops to counter the heating problems associated with a crossbar type of memory controller.

I highly doubt that this was the reason. The lion share of the MC is still in the center of the die.

Exactly which high power, high area part of the MC, do you think, has been moved from the center to the side?

satein · Jan 15, 2007

silent_guy said:
I was just looking at the proportion of IO area vs. core area, irrespective of the number of power pads for the core. The absolute number of core power pads may indeed go down in 65nm, but the absolute number of IO power pads won't.
When keeping a 512-bit bus, the area dedicated only to IOs combined with the higher price for 65nm may be such that a 65nm die isn't economical anymore compared to 80nm.

I don't think power delivery to the core is a problem at this time.

I highly doubt that this was the reason. The lion share of the MC is still in the center of the die.

Exactly which high power, high area part of the MC, do you think, has been moved from the center to the side?

Hi again, thank you for your kindly replying my question :smile: and thank for your great discussions to read on again.

I just come up with some more question (may sound not making sense) on this... What can you do to the design if you need more space on shinking from 80nm to 65nm process? May this be possible by using less number of metal layer so that we can pack more number of Transistors on a layer but getting wilder area!! As a result, you get more area and more pad for anything... (I am a mechanical eng, so the idea is solely based on my guess)

I knew I should start by reading up on that metal layer and die space first, but it could take me age to understand it

.

silent_guy · Jan 15, 2007

satein said:
I just come up with some more question (may sound not making sense) on this... What can you do to the design if you need more space on shinking from 80nm to 65nm process?

Make the die bigger than originally planned.
Redesign some blocks and make them more efficient. (A lot of work, of course)
Reduce some of the on-chip memory. Maybe in the first iteration you were too conservative?
Increase routing density. Unlikely, as this becomes harder when going to smaller densities.

May this be possible by using less number of metal layer so that we can pack more number of Transistors on a layer but getting wilder area!! As a result, you get more area and more pad for anything... (I am a mechanical eng, so the idea is solely based on my guess)

I don't see how that would work? It's really exactly like a PCB that's full of components: going from a 12 layer board to a 10 layer board wouldn't do you much good.

satein · Jan 15, 2007

silent_guy said:
Make the die bigger than originally planned.

Redesign some blocks and make them more efficient. (A lot of work, of course)

Reduce some of the on-chip memory. Maybe in the first iteration you were too conservative?

Increase routing density. Unlikely, as this becomes harder when going to smaller densities.

I don't see how that would work? It's really exactly like a PCB that's full of components: going from a 12 layer board to a 10 layer board wouldn't do you much good.

Thank you :smile:

At least what I think about the metal layer as a layer of PCB concept to myseft is no so bad. I just think it may be possible. So, there would be no different between the cost of 10 metal layers to 12 metal layers die? and only area of the wafer would be the main concern of its cost (or margin), isn't it?

For other options, it sounds requesting more effort to achieve than to the 1st option by simply increasing the die area. Thank you again for kindly answering.

Edit: Typo...

Jawed · Jan 15, 2007

silent_guy said:
I was just looking at the proportion of IO area vs. core area, irrespective of the number of power pads for the core. The absolute number of core power pads may indeed go down in 65nm, but the absolute number of IO power pads won't.

This does help in achieving the increased IO

ower pad ratio.

I don't think power delivery to the core is a problem at this time.

I'm not sure what you mean. Do you mean that with 65nm you expect that power (count of pads) won't be a problem?

I highly doubt that this was the reason. The lion share of the MC is still in the center of the die.

Exactly which high power, high area part of the MC, do you think, has been moved from the center to the side?

The MC in R5xx is asymmetric: reads from memory by TMUs or ROPs, etc. are "orchestrated" by the central memory controller, but the data flow is wholly around the ring bus from ring-stop to the requesting client; whereas writes from the ROPs, say, are pushed through the memory controller and out to the requisite ring-stop.

This architecture, by partially de-centralising memory accesses (effectively halving the size of the crossbar, whilst adding a new bus control unit) was implemented specifically to put some of the "hot" circuitry around the edge of the die, i.e. the ring stops. Obviously I can't comment on the effectiveness of this...

http://www.beyond3d.com/reviews/ati/r520/index.php?p=05

It may be that R5xx was a trial run for this architecture and it only really pays back when trying to implement a 512-bit external bus (i.e. R600). Though it's worth noting that even something small like RV530 has the ring bus architecture. I've no idea what the cost in die-overhead is or what the effects on redundancy might be...

Jawed

silent_guy · Jan 16, 2007

Jawed said:
I'm not sure what you mean. Do you mean that with 65nm you expect that power (count of pads) won't be a problem?

Yes, for core power pads. Just like it's not a problem in 90nm.

The MC in R5xx is asymmetric: reads from memory by TMUs or ROPs, etc. are "orchestrated" by the central memory controller, but the data flow is wholly around the ring bus from ring-stop to the requesting client; whereas writes from the ROPs, say, are pushed through the memory controller and out to the requisite ring-stop.

I assume then that writes by the ROPs to memory are also issued centrally? At least, that's how it looks like on the toplevel memory architecture diagram.
If that's the case, then you could summarize by saying: everything is handled centrally except for the read data flow.

This architecture, by partially de-centralising memory accesses (effectively halving the size of the crossbar, whilst adding a new bus control unit) was implemented specifically to put some of the "hot" circuitry around the edge of the die, i.e. the ring stops. Obviously I can't comment on the effectiveness of this...

And this is what I think doesn't make sense: compared to the issue side of an MC, the return side is trivial and definitely not a hot spot.
The difficult part of an MC is in deciding what to send when to memory. Once the transaction is on its way, it is just a matter of going with the flow and making sure that you never ever have to backpressure the SDRAM, because that's not allowed.

I don't buy the argument that the ring bus has been introduced to prevent hot spots. It's a simple store-and-forward transportation mechanism. And the ring stop that feeds the ring is nothing more than a sofisticated FIFO with some decision logic based on a destination tag (and, maybe, the refresh logic for the SDRAM, a few counters) It's trivial in terms of area (and power) compared to the central logic (as is clearly visible on the R520 die shot.)

That's one thing.

Second, I don't see how the MC, of all things, would be the one that's critical with respect to hot spots. If that's the case, how are you ever going to solve the problems for the shaders?

Each sub-MC may have a whole bunch of inputs from different agents, but in the end the individual throughput is simply the maximum throughput of the SDRAM: you can have bursty periods where a whole bunch of parallel agents are requesting data from the same SDRAM, but on average, only 1 transaction can leave the sub-MC and only 1 can enter. It's a simple as that.
As a result of that, even if you have a TON of FF's to manage the different request queues, most will statically feed the arbiter. So the power consumption won't be all that high.

Compare this with a shader. In R580, you have 12 blocks with each 4 pixels shaders that contains 4 multipliers and 2 adders, all nicely pipelined. In the best performance case, those are all cranking out data at the same time in FP32 format. That's 6x(32+32) FF's to feed the math. At the output, you have 6x32 FF's, so let's estimate that for a 5 deep pipeline you have 2x6x(32+32) + 3*6x32 = 1152 FFs just for the math. Not only that, you also have a pile of toggling combinational logic in between. And all that is grouped in blocks of a quad.

The point is: an MC is large, but the amount of nodes that are toggling is low. In a well-fed ALU, a ton of nodes are toggling all the time. Which part do you think is going to consume the most power? And there's no way you're going to spread a quad around.

I don't believe for a second that the MC is more troublesome wrt hot spots than the shaders.

Look at all the CPU's out there: running at much higher speeds and all the power hungry logic very concentrated. If they can solve it, why would a GPU have problems?

http://www.beyond3d.com/reviews/ati/r520/index.php?p=05
"...the wires around the central controller start to get very dense, increasing the chances for hot spots on the chip and limiting the clock potential."

Wires don't consume power. High routing density automatically results in a lower cell utilization ratio. How exactly does that increase the chances of a hot spot?

... by placing the memory bus around the edges of the chip wire density around the controller is decreased, which can result in higher clock speeds.

This makes sense: reduced wire density -> higher cell utilizaton -> cells close to eachother -> smaller wire loads -> higher speed. (And, of course, higher power density.

)

Though it's worth noting that even something small like RV530 has the ring bus architecture. I've no idea what the cost in die-overhead is or what the effects on redundancy might be...

I said it before: the ring bus in interesting technical curiosity that some reason tickles the imagination of a lot of GPU fans, but in the grand scheme of things it's only little more than an implementation detail to reduce routing congestion...

As for redunancy: obviously your whole ring bus must be fully operational, but it's mostly wires (which are less defect prone than the transistors.) A failure in the FIFO at a ring exit points would kill the issuing agent, so it would have to be counted as part of the agent. A failure of a ring stop, would disable the sub-MC for the SDRAM. All in all, I don't think it makes a big difference.

Jawed · Jan 16, 2007

silent_guy said:
I assume then that writes by the ROPs to memory are also issued centrally? At least, that's how it looks like on the toplevel memory architecture diagram.
If that's the case, then you could summarize by saying: everything is handled centrally except for the read data flow.

Reads consume more bandwidth than writes, I suspect - which is presumably why the architecture is biased this way. Texturing is read only. Generally you can't write a pixel to a render target without testing Z/stencil first (though there are exceptions).

And this is what I think doesn't make sense: compared to the issue side of an MC, the return side is trivial and definitely not a hot spot.
The difficult part of an MC is in deciding what to send when to memory. Once the transaction is on its way, it is just a matter of going with the flow and making sure that you never ever have to backpressure the SDRAM, because that's not allowed.

I think you're not allowing for the crossbar. Somewhere in a GPU data to/from one memory channel has to be routed from/to one of many clients, in parallel with all the other clients.

In G80 it would appear that the crossbar is twixt the shader clusters (and their L1 cache) and the ROP/MC units (and their L2 cache). Each ROP, it would appear, has a dedicated memory channel, so its reads and writes never pass through any kind of crossbar. This also means its MC is reduced in complexity, because it only has one client to deal with, the L2 cache (which supplies L1 with texture data, and presumably also caches all colour/z/stencil buffer operations, whether read or write).

But NVidia still has to implement an 8x6 crossbar within G80 to move data between L1 and L2. At least, that's apparently what's going on based on the architecture diagrams - though it appears as a simple bus. NVidia has successfully removed ROP data from traversing the crossbar (is G80 the first of its GPUs to do this?), so it's only read texture data, early-Z data reads, shaded pixels and computed zixels/samples that contend for the crossbar.

I don't buy the argument that the ring bus has been introduced to prevent hot spots. It's a simple store-and-forward transportation mechanism.

OK, well, what would happen if R600 was based upon a 512-bit internal crossbar? How complex would that be? I dunno, but it seems to me that it's a fairly serious bit of kit, routing data between say four shader clusters (including ROPs) and 8 MCs.

And the ring stop that feeds the ring is nothing more than a sofisticated FIFO with some decision logic based on a destination tag (and, maybe, the refresh logic for the SDRAM, a few counters) It's trivial in terms of area (and power) compared to the central logic (as is clearly visible on the R520 die shot.)

I'm hardly going to argue this point. But simple is good, no?

Second, I don't see how the MC, of all things, would be the one that's critical with respect to hot spots. If that's the case, how are you ever going to solve the problems for the shaders?

Each sub-MC may have a whole bunch of inputs from different agents, but in the end the individual throughput is simply the maximum throughput of the SDRAM: you can have bursty periods where a whole bunch of parallel agents are requesting data from the same SDRAM, but on average, only 1 transaction can leave the sub-MC and only 1 can enter. It's a simple as that.

Confused what you mean by SDRAM?: on-die RAM whether caches or buffers or off-die GDDR video memory?

As far as I can tell the MC, as a whole, implements weighted scoreboards and per-agent queues. I dunno how complex the MC is, but it's at least partly programmable by the driver (even if that's nothing more than setting some registers) and adaptive.

The sub-MCs can't work independently, they need to be overseen. But I've no idea of the complexity/functionality there.

The point is: an MC is large, but the amount of nodes that are toggling is low. In a well-fed ALU, a ton of nodes are toggling all the time. Which part do you think is going to consume the most power? And there's no way you're going to spread a quad around.

Well it beats me, why is R520's MC over 30M transistors (or, at least, ~10% of the die area), excluding the ring bus? What the hell is it doing? There's more area dedicated to the MC in R520 than shader ALUs! The ALUs and register file together, in R520, amount to ~32M transistors :!:

http://www.beyond3d.com/forum/showthread.php?t=37616

I don't believe for a second that the MC is more troublesome wrt hot spots than the shaders.

It isn't, in R5xx. But if instead of R5xx's current architecture, a crossbar and a monolithic 256-bit MC was implemented, how would R5xx perform?

Look at all the CPU's out there: running at much higher speeds and all the power hungry logic very concentrated. If they can solve it, why would a GPU have problems?

Which CPU has a 256-bit (or 512-bit) bus to memory? Which CPU has such a complex crossbar? Why does Cell have the EIB instead of a crossbar, even though, theoretically a crossbar would perform better?

Wires don't consume power. High routing density automatically results in a lower cell utilization ratio. How exactly does that increase the chances of a hot spot?

I presume in the sense that a high density of wires implies a high density of FIFO circuitry and the associated arbitration logic. For example, I presume in X1950XTX with 2GHz effective GDDR4, that these wires are humming at 2GHz which means the FIFOs are also humming at 2GHz, but I don't know...

I said it before: the ring bus in interesting technical curiosity that some reason tickles the imagination of a lot of GPU fans, but in the grand scheme of things it's only little more than an implementation detail to reduce routing congestion...

And presumably the routing complexity associated with a 512-bit bus is going to be more than double that associated with a 256-bit bus. It doesn't seem to be a trivial problem to solve.

It's notable that NVidia chose to solve this problem differently, as I described earlier, by reducing the number of clients using the crossbar and simplying the number of agents that the MC has to cater for (basically, L2 cache). NVidia's managed to cut congestion by cutting down on the number of allowed routes and clients using those routes.

Jawed

KimB · Jan 16, 2007

A ring bus is an alternative to a crossbar, Jawed.

Bob · Jan 16, 2007

Wires don't consume power.

Sure they do. Wires do have static and dynamic power leakage characteristics, just like transistors. It may be less power, but it's power nevertheless.

nelg · Jan 16, 2007

I would wonder how a such a large external bus could be accommodated if something like this comes to fruition? I could only imagine what 8 x the density would do to packaging difficulties.

silent_guy · Jan 16, 2007

Jawed said:
Reads consume more bandwidth than writes, I suspect - which is presumably why the architecture is biased this way. Texturing is read only. Generally you can't write a pixel to a render target without testing Z/stencil first (though there are exceptions).

It doesn't really matter and it doesn't influence the design. You can't backpressure an SDRAM, so you always have to design for the case where read bandwidth is at maximum, irrespective of the usage pattern.

Edit: unless your willing to stop your MC from issuing new requests, but that would be bad for performance.

Somewhere in a GPU data to/from one memory channel has to be routed from/to one of many clients, in parallel with all the other clients.

Yes. Which can be done either with the ring or in a central crossbar. BTW, this crossbar doesn't necessarily have to be located at the same location as the issue side: that depends on whether or not the inflight transaction carries a destination tag (like for the ringbus) or an internal reference tag, which has to be looked up centrally.

This also means its MC is reduced in complexity, because it only has one client to deal with, the L2 cache (which supplies L1 with texture data, and presumably also caches all colour/z/stencil buffer operations, whether read or write).

It would mean that the crossbar doesn't have to route the ROP data. But the ROP would still participate in the scheduling, of course. Which is where the complexity lies.

OK, well, what would happen if R600 was based upon a 512-bit internal crossbar? How complex would that be?

Well, for starters, it wouldn't have to be a 512-bit crossbar! The ring has to be 2x512-bit because it's a shared resource on which 1024 bits can be injected per clock cycle, so that has to be your capacity.
Each SDRAM can only produce 64-bits per clock cycle, so that would have to be the width of the data feeds into the crossbar. You still have all those 1024 bits mangled together internally, but that is at least as smaller problem than the issue side, where you have much more wires going to the central unit. I so suppose it's a solvable problem.

I'm hardly going to argue this point. But simple is good, no?

Unless it increases latency and power and FIFO area.

Confused what you mean by SDRAM?: on-die RAM whether caches or buffers or off-die GDDR video memory

Off-die RAM.

As far as I can tell the MC, as a whole, implements weighted scoreboards and per-agent queues. I dunno how complex the MC is, but it's at least partly programmable by the driver (even if that's nothing more than setting some registers) and adaptive.

The sub-MCs can't work independently, they need to be overseen. But I've no idea of the complexity/functionality there.

I'm not really sure how much arbitration there is on the toplevel, but in theory, there doesn't have to be at lot.
The complex arbitration has to happen at the sub-MC's. It's not clear to me to what extent they have to be overseen, but it is probably not a lot.
Here's an example: say you have a stream of low priority requests for SDRAMs A and B and a whole bunch of high priority requests only for B. And let's assume both streams come from agents that support out-of-order completion. In that case, it wouldn't make a lot of sense for the top MC to throttle the low priority stream: after all, those transactions for SDRAM A would go right through. It's only for SDRAM B that they would be throttled to give priority to the high priority stream. This way, you have avoided head of line blocking on the low priority stream.
The disadvantage is that you now need a queue for each requesting agent for each SDRAM. And the depth will determine how fast a stall on one SDRAM will start blocking the traffic for all clients. Well, the MC area has to be used for something, right?

Well it beats me, why is R520's MC over 30M transistors (or, at least, ~10% of the die area), excluding the ring bus? What the hell is it doing?

The MC section looks very regular. It could be the memory I just mentioned. Or maybe it also contains caches?

It isn't, in R5xx. But if instead of R5xx's current architecture, a crossbar and a monolithic 256-bit MC was implemented, how would R5xx perform?

Maybe there exist similar products on the market, say, from a different vendor, that seem to work fine without ring bus?

Which CPU has a 256-bit (or 512-bit) bus to memory? Which CPU has such a complex crossbar? Why does Cell have the EIB instead of a crossbar, even though, theoretically a crossbar would perform better?

Necessity? Different access patterns? How many useful in-flight transactions can a CPU really schedule? A CPU is mostly latency bound: a 1024-bit bus wouldn't make a CPU all that much faster. Are you sure CELL would be much faster with a crossbar? To do what? It's whole architeture is based on a master DMA controller/CPU that requests the next block of data while the SPE's are crunching data on the current one. It's a powerfull mechanism to hide latency that's also used a lot in telecom chips, but it's severly limits the flexibility of the type of applications that can be used.

I presume in the sense that a high density of wires implies a high density of FIFO circuitry and the associated arbitration logic. For example, I presume in X1950XTX with 2GHz effective GDDR4, that these wires are humming at 2GHz which means the FIFOs are also humming at 2GHz, but I don't know...

No. All DDR designs I've seen have a double wide bus going to the IO pads. It's there that the bus gets multiplexed into a DDR bus. Even so close to the pins, it's always a struggle to get the timing right (even for memories running at, say, 150MHz). It has to be an epic battle to make it work at 1GHz and up. It's very common to create a hand placed hard macro with the pad muxing and demuxing logic and the DLL to tune the clock phase.

silent_guy · Jan 16, 2007

Bob said:
Sure they do. Wires do have static and dynamic power leakage characteristics, just like transistors. It may be less power, but it's power nevertheless.

You mean they are leaking through the oxide? Wouldn't that be an order of magnitude lower than for transistors?

Jawed · Jan 16, 2007

silent_guy said:
It doesn't really matter and it doesn't influence the design. You can't backpressure an SDRAM, so you always have to design for the case where read bandwidth is at maximum, irrespective of the usage pattern.

OK, well, if you've got a decent guess why this is asymmetric, I'd be interested. Of course, it might turn out that version 1.0 is asymmetric and version 2.0 (R600) is symmetric?...

Arguably, texturing is "lower priority" (since texturing latency is normally hidden with ease) but it's also very variable in its bursts and it prolly generates the longest bursts of any client, read or write - ROPs, I guess, always write a single tile at a time, with a tile being of a fixed size (or one of very few choices in size). Additionally, it's possible that texture data will be destined for more than one internal client concurrently, whereas other types of data reads/writes always have a single destination.

Yes. Which can be done either with the ring or in a central crossbar. BTW, this crossbar doesn't necessarily have to be located at the same location as the issue side: that depends on whether or not the inflight transaction carries a destination tag (like for the ringbus) or an internal reference tag, which has to be looked up centrally.

Ultimately, I'm not too worried about where the crossbar is, or whether there's a set of crossbars distributed across the GPU acting as a hierarchy. I'm just alluding to the routing complexity problem, which can't be confined to the MC itself - it's a system bus(es) problem.

It would mean that the crossbar doesn't have to route the ROP data. But the ROP would still participate in the scheduling, of course. Which is where the complexity lies.

Actually, I was thinking of this as purely being a cache-controller problem. Though being honest I really haven't a clue how caches are interfaced/controlled to memory off-die. And of course there's advantages in having a ROP that can multi-thread etc. which would definitely make ROP<->MC interaction interesting.

I'm not really sure how much arbitration there is on the toplevel, but in theory, there doesn't have to be at lot.
The complex arbitration has to happen at the sub-MC's. It's not clear to me to what extent they have to be overseen, but it is probably not a lot.
Here's an example: say you have a stream of low priority requests for SDRAMs A and B and a whole bunch of high priority requests only for B. And let's assume both streams come from agents that support out-of-order completion. In that case, it wouldn't make a lot of sense for the top MC to throttle the low priority stream: after all, those transactions for SDRAM A would go right through. It's only for SDRAM B that they would be throttled to give priority to the high priority stream. This way, you have avoided head of line blocking on the low priority stream.
The disadvantage is that you now need a queue for each requesting agent for each SDRAM. And the depth will determine how fast a stall on one SDRAM will start blocking the traffic for all clients. Well, the MC area has to be used for something, right?

Agreed with all that. Additionally the sequencer(s) (which determines instruction issue to ALU, TMU, VFU pipelines) knows how much work it's got outstanding (i.e. it knows how much latency it can cover) so it hopefully intercedes in the MC when those low priority tasks suddenly go up in priority because their input queue is about to fill up.

The MC section looks very regular. It could be the memory I just mentioned. Or maybe it also contains caches?

Apparently people who know these things take one look at R520 and say "that's the MC". So I guess it looks like all MCs do. But it's not clear from the picture just how much of the surrounding area is also MC :???:

Anyway, I assume the 8 blocks are one per GDDR channel. I don't know what these mass blocks of memory would be for (if they are mass blocks of memory), though, since I would expect that the mass of memory (10s of KB in caches and buffers, whatever) is with the clients. Is it doing a lot of dynamic re-ordering? Does this MC manage a memory (GDDR) tiling hierarchy? with only portions of the lowest level seen by clients at any one time? Dunno, all very strange.

Are you sure CELL would be much faster with a crossbar?

Well that's what the guys who designed it say. But the EIB is vastly simpler to implement (for the desired peak bandwidth of ~300GB/s with 12 or so clients), so it was worth the trade-off.

No. All DDR designs I've seen have a double wide bus going to the IO pads. It's there that the bus gets multiplexed into a DDR bus.

I suppose in this case what you're saying is that the MC is distinct from a memory interface unit whose only task is to multiplex to/from the pads.

So, anyway, that's a lot of wires and associated FIFOs for a 512-bit bus

I suppose it's worth noting that each ring stop in R5xx has a lot of wires of its own:

1024 to form the ring connections to its neighbours (presuming it has two 512-bit interfaces, one for each ring stop on either side)
data path for texturing data
data path for ROP reads (shared with texture data?)
write data path from the MC
control signalling from the MC (each ring stop is under the command of two sub-MCs)

Jawed

The LAST R600 Rumours & Speculation Thread

R300King!

Jawed

Kaotik

Drunk Member

Arty

KEPLER

Kaotik

Drunk Member

Arty

KEPLER

TG01

silent_guy

satein

silent_guy

satein

Jawed

silent_guy

Jawed

KimB

Bob

nelg

silent_guy

silent_guy

Jawed

Similar threads