How about a little R700 speculation?

In a sense the G80 is already 8 way multicore. Having glazed through the presentation I'd say it's about being able to have several application contexts running concurrently on GPU. (concurrently in the parallel cores sense, not in fast context switch timesharing sense)
(disclaimer: I know far more about CPU architectures than I do GPU architectures, so that's what I relate things to. I think it's a natural fit, but it might just be me).

The G80 can very clearly be thought of as a type of multi-core processor, but I realized you can also think about the R520 as having features of multi-core processors that would allow it to span multiple dies. You can almost directly relate the G80 and R520 to the UMA and NUMA memory architectures found in multiprocessor servers.

Quick rundown for those who down't know what UMA and NUMA are:
UMA means Unified Memory Architecture. In a UMA system there is a single pool of memory of uniform distance every CPU. This is the type of architecture that current Intel Xeon servers use: the memory controller is located on the northbridge and every CPU has a connection to the northbridge. It doesn't matter what CPU is trying to talk to what part of memory, it's always CPU->northbridge->memory.
NUMA means Non-Unified Memory Architecture. In a NUMA system there are multiple pools of memory and can have non-uniform distance from every CPU. This ist he type of architecture that current AMD Opteron servers use: the memory controllers are located on each CPU, and the CPUs connect to each other. This means when a CPU accesses memory it can either be local to that CPU (on that CPU's memory controller) or on another CPU, requiring at least 1 CPU->CPU hop to access.

Now from the above you can see that all Nvidia cards (including G80) are, and all ATI cards until the R520 were, UMA architectures. It doesn't matter what cluster or quad requests what memory, it's always the same distance from the cluster/quad, through the crossbar, to memory. This unification makes it extremely easy to program from a software perspective since you can run any bundle / quad / thread accessing any texture on any of the computation units. The problem with UMA architectures is the vary thing that makes this possible: the crossbar. The crossbar required to interconnect every one of the 8 clusters with every one of the 6 memory controllers must be gigantic.

Now you can get away with doing a huge crossbar if you stay within the chip, but it becomes prohibitive if you want a multi-die GPU. Lets hypothetically split some number (lets say 2 or 4) of G80 clusters into mini-GPUs. Each mini-GPU now needs to connect to something, and this "northbridge"/"master"/crossbar chip becomes your limiting factor, and non-reusable. You'll end up making a few different crossbar chips in order to support some number of clusters with some RAM width, and then create the rest of the SKUs you're intersted in by selling disabled versions of the next highest crossbar. Even if you could build such a beast crossbar chip, you can imagine the pin-count required will be insane, since it has to connect to every mini-GPU and every memory module.

Now to where the server industry has been going for a while, and what R520 is: NUMA. The R520's ring bus is clearly a NUMA architecture. Every quad is connected to a single client port on the ring. This means that a texture request from a quad could be serviced by the ring stop next to it, or as far as 2 stop away. This means that current programs executing on R520 must be able to handle the variable latency of requests.

Now here comes the fun part: what happens what you cut the R520 in 1/4ths leaving 1 ring stop and all the clients associated with it and call that a mini-GPU. Ignoring a bit of controll logic, you have 1 fully working mini-GPU that has 1/4th the quads and 1/4th the RAM width, aka a low-end SKU. Want a mid-rage SKU? Take 2 of these mini-GPUs and join them in a ring. High-end, take the original 4. Super-high-end, take 8 and make a large ring out of them.

I know I'm ignoring a bit of control logic, scheduling, and other extraneous stuff, but none of that should be too hard to overcome. As far as inter-chip bandwidth goes to match the current ring-bus you "only" need 2 times the DDR bandwidth of a ring stop, meaning about 64GB / 4 * 2 = 32GB/sec. HT3 is already spec'd at 40GB/sec, so this should be no problem to keep up with.

In conclusion the current ATI architecture can lend itself to a fully unified, multi-die, NUMA topology. I admit I was skeptical at first but after thinking through this post I think it's not only very feasible, but would provide a huge cost advantage. Also this does not mean that Nvidia could not produce something similar, it's just the current architecture wouldn't work while ATIs' current architecture can.
 
Last edited by a moderator:
Umh..imho that's anothering thing, TheINQ was hinting at multichip configurations, the patent is not about that
 
My first thought here is that you have a hierarchy of ring-buses. You have a ring-bus for chip-to-chip data sharing, with each chip having a smaller-scale ring-bus (smaller scale compared with R580) because it interfaces to less memory chips, i.e. needs less ring stops.
That's actually the first thing I was thinking of after reading the original post. It would be a good explanation for why ATI went for the ring-bus idea. I assume AMD can help ATI out as well. They do this sort of communication in 4-way and 8-way systems, right?

It might be one of those things where multiple dies are in the same package. The chip to chip interconnects could be fast and wouldn't need any pins.

A lot of questions, though. Would you want two different types of dies or one? It seems like there would be a fair amount of replication if there wasn't a master die somewhere to handle scanout, AVIVO, the control processor, reading from system memory, and load distribution.

Pretty neat idea if it doesn't follow the SLI paradigm of how to do multi-GPU, but it's not clear if it's feasible.
 
Hopefully SLI will die sooner or later.. it does its job (and it does it well in many cases) but it's sooo ugly
 
Hopefully SLI will die sooner or later.. it does its job (and it does it well in many cases) but it's sooo ugly

Put the side back on the dang case! And quit computing in your underwear! :p
 
If you assume that there's a distributed virtual memory system in place (i.e. each shader chip has a page table for the entire graphics memory attached to all chips that make up the GPU) then it doesn't seem much of a stretch to assume that load-balancing of VS workload is per chip "owning" those vertices (i.e. those vertices are in the chip's own memory, so it shades them).

The VS workload delivered to the GPU by the CPU is tiled across the chips simply because of the memory space it consumes. The graphics driver can also choose where to put VS workload, since the GPU's virtual memory system (and its threads) is ultimately under the control of the CPU (in WDDM2.1).

Obviously the VS->GS mapping is a bit more tricky, as a single primitives can consist of vertices from multiple chips. But that seems to me to be a buffering problem, where every nth shaded vertex needs to be distributed to other chips (keyed by the virtual memory system).

Finally PS workload is a simple function of screen-space tiling, which is trivially parallelisable across the chips.

So, it seems to build a distributed virtual memory system you need to implement a coherency protocol. But I think that because only ROPs require write-in-place use of memory (via an on-die buffer, and screen-space tiled for privacy), that you can dramatically simplify the coherency protocol (i.e. reader takes a page copy for itself). Most writes to memory in a GPU are streamed, and those writes require completion before reading. "Completion" may well mean "page is filled", but basically it appears that GPU memory doesn't have to support multiple concurrent clients writing to a single page in memory.

I don't think there's any getting around the fact that you want some kind of AVIVO chip and a PCI-Express chip to glue together the shader chips' dangly bits at either end of the rendering pipeline. But I think a symmetrical organisation of shader chips with "collaborative" load-balancing, based upon the location of work and shared resource usage (keyed by the page tables) will work.

Jawed
 
That's actually the first thing I was thinking of after reading the original post. It would be a good explanation for why ATI went for the ring-bus idea. I assume AMD can help ATI out as well. They do this sort of communication in 4-way and 8-way systems, right?
AMD systems don't use a ring-bus, they use a point-to-point connection scheme. I guess it could be made into a ring, though given the restrictions on the number of chips it wouldn't be worth it. The hypertransport links up to 4-way provide direct connections between every socket. 8-way doesn't allow direct connections between all the chips, but does keep the number of jumps pretty low.

A ring-bus didn't fit AMD's need for a low-latency interconnect for the number of chips in its target market.

Compared to a direct connection, a ring-bus inserts additional jumps, each of which adds latency.
It's fine if the design is targeted at something that tolerates latency, or--like IBM--AMD could get people to pay serious cash for an MCM that can route a ring bus between a bunch of chips on a package, but not so good otherwise.

On-chip, AMD uses a cross-bar between all components that need access off-chip.

It might be one of those things where multiple dies are in the same package. The chip to chip interconnects could be fast and wouldn't need any pins.
I think IBM does use ring buses for its MCMs, but those aren't cheap.
 
Hopefully SLI will die sooner or later.. it does its job (and it does it well in many cases) but it's sooo ugly

I'd expect you even more compared to others to notice more how crowded things are on your screen than inside your case :p
 
AMD systems don't use a ring-bus, they use a point-to-point connection scheme. I guess it could be made into a ring, though given the restrictions on the number of chips it wouldn't be worth it. The hypertransport links up to 4-way provide direct connections between every socket. 8-way doesn't allow direct connections between all the chips, but does keep the number of jumps pretty low.
While you can have fully-connected 4-way system, I know that there are shipping systems that are only a square topology, without the cross connections to be fully connected. Opterons in a square are identical to the R520 ring.

The important thing about the R520 design is that they have a working NUMA chip. Going from the current 2-neighbor ring to a 3-neighbor mesh is just engineering.

However I'm not convinced that they need to do anything other than a ring bus for even 8 chips. Graphics cards are very good at hiding latency: simply allow more threads to be in flight at once. It seems like a win to spend very few transistors on a simple 2-port ring connection, and spend extra transistors on the number of threads in flight.
 
That's interesting. What did they do with the extra cc-HT link on each Opteron?

A point-to-point interconnect can be made into a ring, but I thought it would be desirable to save the extra hop.
I suppose it would save on board costs by not having to route the last two links.
 
They could probably be used for IO or co-processors, I've seen some server mobos with HTX slots, but I think in others they're just dead links to cut costs.
 
AMD systems don't use a ring-bus, they use a point-to-point connection scheme. I guess it could be made into a ring, though given the restrictions on the number of chips it wouldn't be worth it. The hypertransport links up to 4-way provide direct connections between every socket. 8-way doesn't allow direct connections between all the chips, but does keep the number of jumps pretty low.
Yeah, that's why I said "sort of". I've also seen the "square" topology.

Anyway, my point was that AMD's experience should help. This sort of scheme will definately want as few jumps as possible. Colour/Z buffer reads/writes should be confined to the memory channel that is connected to each chip, but texture reads could come from anywhere. You don't want exorbitant latency or you'll waste space on thread juggling and register files, and when things are travelling around the traffic will be amplified.
 
Opterons in a square are identical to the R520 ring.

The important thing about the R520 design is that they have a working NUMA chip. Going from the current 2-neighbor ring to a 3-neighbor mesh is just engineering.

However I'm not convinced that they need to do anything other than a ring bus for even 8 chips. Graphics cards are very good at hiding latency: simply allow more threads to be in flight at once. It seems like a win to spend very few transistors on a simple 2-port ring connection, and spend extra transistors on the number of threads in flight.
I hate to rain on the parade, but R520/R580 is almost as centralized as any other GPU and hence a long way away from being NUMA: the ring is only used for the read data path. Which is by far the easiest problem to solve. Other than that you have a centralized crossbar that gathers request and write data from different agents and sends them over a dedicated link straight to individual memory controllers. And lets all be lucky that's that case because otherwise performance would really suck or complexity/overhead signficantly higher or both.

A complete ringbus that also transports requests can be a workable solution if the ratio of request agents vs memory agents is around or smaller than 1 and when the maximum amount of outstanding requests per request agent is low (which is exactly what a multi-CPU system would look like). But in the case of a GPU, you have tons of request agents, all fighting for their share. Impossible to schedule transactions efficiently with a device that can only feed 1 request per cycle.
 
Last edited by a moderator:
Back
Top