ATI's former Ring Bus

swaaye

Entirely Suboptimal
Legend
Supporter
I've been pondering this a lot since RV770 arrived. ATI really trumpeted their ring bus as total wonderfulness back when they used it and nobody here was talking about how it was fundamentally flawed. Then, along came RV7x0 and they completely did away with it and returned to a more traditional approach.

What happened? Are there any advantages to the ring bus compared to what they've gone back to now?
 
The ring-bus was a design approach for wiring the traces within the chip, not about performance characteristics. Obviously it wasn't needed anymore for the new GPU -- probably the overall layout didn't justify it.
 
As I remember, ring-bus is good solution for a situation, where you have only a few clients, while you need a wide bus. Complexity of this bus grows with the number of clients. Older crossbars had opposite problem - their complexity grows with the width of the bus. RV770 had 2,5x SIMDs of R600 and needed only 256bit (narrower) bus, so R600-style ring-bus wasn't suitable for that design - it would become too complex.

I may be mistaken, but ring-bus was also used for other purposes - e.g. for texel sharing between different quads. R600 architecture didn't use screen tiling. On R300/R4xx/R5xx/R7xx each quad/SIMD works on it's own tile, so you have to care only for the border areas. That's why there is a lot less data sharing a ring-bus isn't necesarry, too. I think a local crossbar is used for this purpose on RV770. RV770 also uses separated bus for low-bandwidth clients (UVD, PCIe, CF...), which were previously connected to the ring-bus.
 
I wouldn't have expected AMD to trumpet the downsides of the ring bus, though there was some commentary on the potential downsides of a ring bus when it came to having to handle the more variable dynamic behaviors and congestion at hot spots. It does offer a conceptually simple way to add a lot of bandwidth and clients with only incremental gains in complexity.

Cutting the very large ring bus down for RV670 didn't have a massive performance impact, for the most part, which indicated the big bandwidth numbers the ring bus supposedly provided didn't really matter, given the ability of the hardware to use it.
The die size and transistor count were reduced by a fair amount as well, though there wasn't a die shot of the R600 die to show just how much physical space it took up. Nor have I seen anything about exactly what kind of implementation ATI used.

R600's internal data traffic was theoretically more flexible. Traffic between clients and memory controllers was possible in large amounts over the ring bus.

RV770's change to a crossbar came with an apparent reduction in this free flow, and there was the adoption of a tiled memory architecture that apparently fit what the performance profiling of R600 went through.
Maybe this solution is less flexible, but not much information publically indicates that R600 was able to yield much from this flexibility anyway.
 
Older crossbars had opposite problem - their complexity grows with the width of the bus.
Crossbars grow in complexity with the product of clients times width ... but there is less co-dependency in communication of different clients, a bus always has to be over-dimensioned to take into account cases where a lot of clients want to talk to each other.

Eventually switched networks will be the only real option, everything else scales worse.
 
Well this is definitely what I was looking to hear. I did try searching the forum before I posted, but I just couldn't find any in-depth discussion of the change.

It does seem like the ring bus design was just another part of the mis-targeted aspects of R600 and friends. No one's coming up with scenarios where it was noticeably beneficial, although I suppose since it was helping feed a GPU that was held back in other ways that maybe it's wrong to say that.
 
Last edited by a moderator:
Search on 'silent_guy' and 'ringbus'. :)

The best it can do, in terms of performance, is do as well as a crossbar. But there are cases where it will do worse if you're not careful.

It was great marketing though...
 
Search on 'silent_guy' and 'ringbus'. :)

The best it can do, in terms of performance, is do as well as a crossbar. But there are cases where it will do worse if you're not careful.

It was great marketing though...

Sure if you give the xbar exponentially more area/wire tracks...
 
Sure if you give the xbar exponentially more area/wire tracks...
#src * #dest, not exponential. Big difference. ;)

High end switching chips have far more source and destinations (though the bus is probably not as wide.) In addition, you can use distributed placement, make tradeoffs between bus width, clock speed, xbar topology etc.

Let's throw some first order numbers together: say you have a bus of 256 wires per client and a wire density of 80nm (smaller number of lower layers, larger for higher layers) and 4 routing layers, with drivers spread along the track, then you're talking about a channel width of 5um per source.
Say you have 16 sources, that's a combined channel of 80um wide. Not the end of the world, even if I'm off by a factor of 2 or 3.
Placement is even less of an issue as long as you're willing to give up a bit of area density at switching nodes (which could be placed in a distributed way, e.g. one at each destination point.)
 
What's the general scaling of latency of a crossbar?

Earlier research showed that forwarding networks on CPUs typically scaled quadratically with issue width, but I don't know how much that would apply to a memory crossbar.
 
#src * #dest, not exponential. Big difference. ;)

its is effectively order(nm) ~= order(x^2). Close enough since we really only need to consider the portion of the curve in the +x+y quadrant.

High end switching chips have far more source and destinations (though the bus is probably not as wide.) In addition, you can use distributed placement, make tradeoffs between bus width, clock speed, xbar topology etc.

High end switching chips generally aren't pushing bandwidth internally since the offchip bandwidth is so low (~1x 1 write @ 1-10GT/s). The bandwidth multipliers are such that they can time multiplex fairly easily over an internal bus many of the external connections resulting in relatively moderate actual xbars. Plus they also generally don't care about latency at all so things like 40+ cycle scheduling algorithms are fine as well.

And if you go to a multiple topology, you start to get into a whole bunch of interconnection network issues.

Let's throw some first order numbers together: say you have a bus of 256 wires per client and a wire density of 80nm (smaller number of lower layers, larger for higher layers) and 4 routing layers, with drivers spread along the track, then you're talking about a channel width of 5um per source.
Say you have 16 sources, that's a combined channel of 80um wide. Not the end of the world, even if I'm off by a factor of 2 or 3.
Placement is even less of an issue as long as you're willing to give up a bit of area density at switching nodes (which could be placed in a distributed way, e.g. one at each destination point.)

First, you realistically, need to be assuming upper layers for an xbar. Top sig + doubled top-1 is pretty standard.

Off likely by a factor of 3-4x just an data bus wire tracking and then the various other overheads factor in. Real world you are talking significant wire track overheads in every xbar I've ever seen in actual silicon to the realistic point that given the right layering and centralization, they are visible with the naked eye.

I won't even get into the scheduling complexities that arise if you actually want to get peak performance out of an xbar.

I won't argue if the ATI ringbus was good or bad, but once you get much beyond 4-6 endpoints on an xbar, the costs start to go up fairly fast unless you aren't really implementing an xbar and instead of just doing broadcast and mux down trees which I would argue are probably what most of the graphics vendors have been doing.
 
Last edited by a moderator:
What's the general scaling of latency of a crossbar?

Earlier research showed that forwarding networks on CPUs typically scaled quadratically with issue width, but I don't know how much that would apply to a memory crossbar.

Depends where you are on the knee of the curve. Basically, below X number of endpoints, the basic overheads outweigh the additional RC/logic scaling costs, post X the RC/logic scaling costs significantly outweigh the basic overheads.

So until you hit X, the latency remains roughly set. Once you hit the knee of the curve, it is roughly cubic/exponential in growth.
 
... unless you aren't really implementing an xbar and instead of just doing broadcast and mux down trees which I would argue are probably what most of the graphics vendors have been doing.
Yes, that was my assumption. It makes things much easier.
 
I'm curious about this distinction entails.

So as opposed to a crossbar with oodles of crossover points routing requests, a broadcast scheme would have clients on one side of the system request bus broadcast data to many or all of the consumers, and the broadcast would be filtered by multiplexers at the end?

Would this be helped by the fact that the data traffic for the texture caches is one-way, and that there are fewer broadcasters (L2 texture cache partitions/memory controllers) than there are consumers?
 
Nehalem EX's ring bus

http://www.semiaccurate.com/2009/08/25/intel-details-becton-8-cores-and-all/

The ring bus is actually four rings, with the data ring being 32 bytes wide in each direction. It looks a lot like the ring in Larrabee, but Intel has not announced the width of that part yet. That said, it is 1024 bits wide, 512b times two directions. There are eight stops on the Becton ring, and data moves across it at one stop per clock.

To pull data off the ring, each stop can only pull one request off per clock. With two rings, this could be problematic if data comes in from both directions at the same time. The data flow would have to stop or the packets would have to go around again. Neither scenario is acceptable for a chip like this.

Intel solved this by putting some smarts into the ring stops, and added polarity to the rings. Each ring stop has a given polarity, odd or even, and can only pull from the correct ring that matches that polarity. The ring changes polarity once per clock, so which stop can read from which ring on a given cycle is easy to figure out.

Since a ring stop knows which other stop it is going to talk to, what the receiver polarity is, and what the hop count is between them, the sender can figure out when to send so that the receiver can actually read it. By delaying for a maximum of one cycle, a sender can assure the receiver can read it, and the case of two packets arriving at the same time never occurs.

In the end, the ring has four times the bandwidth of a similar width unidirectional ring, half the latency, and never sends anything that the receiver can't read. The raw bandwidth available is over 250GBps, and that scales nicely with the number of stops. You could safely speculate that Eagleton will have a 375GBps ring bus if the clocks don't change much.
Looks like Larrabee on a smaller scale, huh?

Jawed
 
Back
Top