R6xx Speculation - Implementation of Ring Bus for Multi chip interconnect.

what happens when both memory controllers want to address the same IC at the same time?

I thought that's what some of those ATi patents were about, and what Jawed has been talking about pertaining to memory address space in relation to multiple cores?
 
Yes, but the IC supports 32-bit or 16-bit addressing, not 16bitx2, from what I understand. Patents are not always nessecarily used, although they may be filed...
 
Last edited by a moderator:
Yes, but the IC supports 32-bit or 16-bit addressing, not 16bitx2, from what I understand.

True enough, even if it can only address one core/gpu at a time through a 16-bit connect...It still seems like a decent advancement when the ability is there for one gpu to address more or less of the buffer (even if only as efficient on a per chip basis) based upon what it is rendering. As always, load balancing will be key.
 
Last edited by a moderator:
I thought that's what some of those ATi patents were about, and what Jawed has been talking about pertaining to memory address space in relation to multiple cores?
None of those links I posted is directly relevant to the "clamshell" thing as far as I can tell.

What's interesting about clamshell, is that it appears to get both memory chips sending/receiving data from a single command that's sent to both of them.

Erm... it seems there's lots of conflicting interpretations of the meaning here. Oh well...

Jawed
 
As I quoted from the whitepaper, this allows configurations of both 512mb and 1024mb using the same bus interface, keeping in mind what Mintmaster said about burst-length. Yes, there may be a penalty in bandwidth, but if your controller is designed the right way, this loss of bandwidth is almost a moot point.
Maybe my question wasn't worded clearly, or I'm interpreting the whitepaper incorrectly. But double size memory configurations using the same bus interface and memory modules (just more of them) have been around for a long time now. What I'm after is the difference between:
1) using a shared command/address bus and a shared 32 bit data bus, with a chip select (CS) signal activating the right module
2) the clamshell mode: a shared command/adress bus and two separate 16 bit data busses

Since the command/address bus is shared in both cases I don't think there is a latency or bandwidth advantage. But I guess the routing is easier in clamshell mode since you have just 16 traces going to each chip instead of 32 traces that have to be connected to both chips somehow. I'm wondering if there are other advantages.
 
I dunno. I only have the whitepaper to go by, and nothing else. I suppose that an advantage would be power saved/module, but I am unsure if they allow per module clockscaling. All i got from the whitepaper is that it's an easier config for designers. I suppose also that trace-length would have an affect on the bandwidth, as the IC's still need to "sync up", regardless, but maybe by offering "clamshell", this impact could be less?
 
Here's an random possibility for an multi-die solution: something like the Smithfield P4 or what they do for large CMOS sensors that exceed reticle size.

Take two cores like RV670 and cut them out of the wafer in pairs.

Either the bus is a quick drop from one core down into the package and back up again, or they could stitch them together by linking them with data lines passing through one of the chip's metalization layers.

How much layers would it require in this way? :-|
 
Since the command/address bus is shared in both cases I don't think there is a latency or bandwidth advantage. But I guess the routing is easier in clamshell mode since you have just 16 traces going to each chip instead of 32 traces that have to be connected to both chips somehow. I'm wondering if there are other advantages.

The major advantage (*far* more than less PCB traces) is that you only have to deal with point-to-point connections, which makes it much easier to control bus terminations/impedance. Wave segments with different impedances will cause reflections and that kills your signal integrity. The problem gets worse with increasing clock speeds, so they still get away with it for address/command signals.

It's for the same reason that all other high speed busses have moved over to point-to-point.
 
How much layers would it require in this way? :-|

I'm not sure.
On one hand, adding more signaling would be more challenging to pack into the current metalization layers, but I don't know to what extent.

It's not unheard of, as large ICs that exceed reticle size already do this, and they use higher and coarser metal layers than the rat's nest that is the low-level interconnect.

In addition, the T stops would be nothing more than an additional stop on the ring bus, and in place of the memory controller stops that would have been present regardless.
They'd probably best be placed on the edge of the die, likely where IO or DRAM pads would have gone anyway, so it's a place where the other metal layers are likely trying to avoid crowding by default.

So it might be one more metal layer, or it is possible with some creative engineering to get away with what is already there.
 
I'm not sure.
On one hand, adding more signaling would be more challenging to pack into the current metalization layers, but I don't know to what extent.

It's not unheard of, as large ICs that exceed reticle size already do this, and they use higher and coarser metal layers than the rat's nest that is the low-level interconnect.

In addition, the T stops would be nothing more than an additional stop on the ring bus, and in place of the memory controller stops that would have been present regardless.
They'd probably best be placed on the edge of the die, likely where IO or DRAM pads would have gone anyway, so it's a place where the other metal layers are likely trying to avoid crowding by default.

So it might be one more metal layer, or it is possible with some creative engineering to get away with what is already there.

I just wondered because of the high bandwidth requirements and the appearant non-optimized layout to begin with. Off course compared to the socket-solution, it looks quite nice, but the price to pay i.e. an extra layer might be too high.
 
An additional metal layer may not be necessary, depending on how much room is available after removing half the DRAM pads by halving the external bus from a RV670-like chip.

The wiring and vias for IO or DRAM interfaces already punch through the metal layers, so metal leads that connect off-core would pass through an area already meant for external connections, so an extra layer may not be necessary.

Even if there were an extra layer, it would be an incremental increase in complexity that would have to be weighed against other costs, such as the costs for implementing an external interface and protocol, such as Hypertransport or Crossfire.

Crossfire is present already in GPUs, so that seems like a more sure bet than Hypertransport in the short term.

Direct die connection or connection through the package would require a less complex protocol. The link within the die could function as a somewhat slower ring-stop, and would not need to be engineered with requirements such as long link length or other wrinkles added to protocols that must travel off-chip.

It's merely a thought, as there are likely reasons why AMD's long-term strategy might favor stronger development of off-chip connections, though I doubt it's the fear of another metal layer that may not even be needed.
 
Following your comment, the inter-die connection could be a straight forward solution for 2 cores. But how about multiple stitched dice? I can't imagine that more of these can be stitched together, not neccesarily due to communication problems, but based on manufacturing issues.

Assuming that an of-die connection structure would not be hampered too much by manufacturing, it seems to me that flexibility is gained in the choice for the ammount of dice. But would the multiple stops and off-on-die connections not deterioate the performance by too much that other communication topologies should be chosen? Maybe a more distributed architecture? Multiple rings? I don't know. :???:
 
The scheme I have in mind is more of an intermediate "quick and dirty" step towards an multichip product than a permanent solution.

Scalability is an issue beyond two cores, since one neighbor is enough to halve the DRAM bandwidth per core.

Rather, if AMD wanted to save some time or cash by not implementing a full bus protocol on a GPU or it wanted to save a little on chip packaging, such a solution would be a good hack. Unlike an external link, this solution would offer latencies fast enough for it to be almost completely transparent to software.

The cost would likely might be a little additional metal layer work, a specialized last-level mask, and a yield impact. The yield question is a little difficult to postulate.

In theory, they could cut the chip if one half is bad, so there is flexibility there.
Failing that, redundancy might allow one core to be shut off for a value SKU, or with luck the bad core can leave its memory controller on so the entire chip can act as a medium value SKU.

Even if they can't, this method avoids the problem where a fault in a more complex MCM package leads to two good dies being discarded.
A single stitched chip would be simpler to wire.
 
Maybe a n00b question, but is it possible (in some realistic sense) to limit the transfers from one die to another in order not to lower the equivalent bandwidth of the complete ring? (A question with the assumption that slower t-stop are used.)
 
Since the cores would be designed to operate either in pairs or alone, the system would look like two separate rings that might share a few segments.

From the point of view of each core, each ring stop that connects to the other core will look like a slightly slower memory controller. The trip would probably amount to a few extra nanoseconds on a memory access, something the GPUs would have to tolerate anyway.
A possible downside, depending on how well the memory locations are balanced, is that each T-stop looking like a memory controller means the ring bus in each core would likely need to be full-width, not half-width as it is for RV670.

A message queue on the T stops would put an upper limit on the number of requests that can be outstanding on the other core.

If it turns out that for some reason that a shader is only pulling from the remote core, it might be easier to just context switch the program over to the other core.
The R6xx's architecture seems very robust in the handling of contexts.

NUMA systems have dealt with this kind of problem for many years, so the problem of load-balancing is not new to GPUs. With the virtualization capabilities of the architecture, a pair of R6xx cores could be set up to interleave memory locations to balance the load, or software could monitor accesses and try to keep certain memory locations closer to the core that needs them.

The price is mostly in the form of higher latencies, but bandwidth and throughput should be better utilized.
 
Back
Top