(disclaimer: I know far more about CPU architectures than I do GPU architectures, so that's what I relate things to. I think it's a natural fit, but it might just be me).In a sense the G80 is already 8 way multicore. Having glazed through the presentation I'd say it's about being able to have several application contexts running concurrently on GPU. (concurrently in the parallel cores sense, not in fast context switch timesharing sense)
The G80 can very clearly be thought of as a type of multi-core processor, but I realized you can also think about the R520 as having features of multi-core processors that would allow it to span multiple dies. You can almost directly relate the G80 and R520 to the UMA and NUMA memory architectures found in multiprocessor servers.
Quick rundown for those who down't know what UMA and NUMA are:
UMA means Unified Memory Architecture. In a UMA system there is a single pool of memory of uniform distance every CPU. This is the type of architecture that current Intel Xeon servers use: the memory controller is located on the northbridge and every CPU has a connection to the northbridge. It doesn't matter what CPU is trying to talk to what part of memory, it's always CPU->northbridge->memory.
NUMA means Non-Unified Memory Architecture. In a NUMA system there are multiple pools of memory and can have non-uniform distance from every CPU. This ist he type of architecture that current AMD Opteron servers use: the memory controllers are located on each CPU, and the CPUs connect to each other. This means when a CPU accesses memory it can either be local to that CPU (on that CPU's memory controller) or on another CPU, requiring at least 1 CPU->CPU hop to access.
Now from the above you can see that all Nvidia cards (including G80) are, and all ATI cards until the R520 were, UMA architectures. It doesn't matter what cluster or quad requests what memory, it's always the same distance from the cluster/quad, through the crossbar, to memory. This unification makes it extremely easy to program from a software perspective since you can run any bundle / quad / thread accessing any texture on any of the computation units. The problem with UMA architectures is the vary thing that makes this possible: the crossbar. The crossbar required to interconnect every one of the 8 clusters with every one of the 6 memory controllers must be gigantic.
Now you can get away with doing a huge crossbar if you stay within the chip, but it becomes prohibitive if you want a multi-die GPU. Lets hypothetically split some number (lets say 2 or 4) of G80 clusters into mini-GPUs. Each mini-GPU now needs to connect to something, and this "northbridge"/"master"/crossbar chip becomes your limiting factor, and non-reusable. You'll end up making a few different crossbar chips in order to support some number of clusters with some RAM width, and then create the rest of the SKUs you're intersted in by selling disabled versions of the next highest crossbar. Even if you could build such a beast crossbar chip, you can imagine the pin-count required will be insane, since it has to connect to every mini-GPU and every memory module.
Now to where the server industry has been going for a while, and what R520 is: NUMA. The R520's ring bus is clearly a NUMA architecture. Every quad is connected to a single client port on the ring. This means that a texture request from a quad could be serviced by the ring stop next to it, or as far as 2 stop away. This means that current programs executing on R520 must be able to handle the variable latency of requests.
Now here comes the fun part: what happens what you cut the R520 in 1/4ths leaving 1 ring stop and all the clients associated with it and call that a mini-GPU. Ignoring a bit of controll logic, you have 1 fully working mini-GPU that has 1/4th the quads and 1/4th the RAM width, aka a low-end SKU. Want a mid-rage SKU? Take 2 of these mini-GPUs and join them in a ring. High-end, take the original 4. Super-high-end, take 8 and make a large ring out of them.
I know I'm ignoring a bit of control logic, scheduling, and other extraneous stuff, but none of that should be too hard to overcome. As far as inter-chip bandwidth goes to match the current ring-bus you "only" need 2 times the DDR bandwidth of a ring stop, meaning about 64GB / 4 * 2 = 32GB/sec. HT3 is already spec'd at 40GB/sec, so this should be no problem to keep up with.
In conclusion the current ATI architecture can lend itself to a fully unified, multi-die, NUMA topology. I admit I was skeptical at first but after thinking through this post I think it's not only very feasible, but would provide a huge cost advantage. Also this does not mean that Nvidia could not produce something similar, it's just the current architecture wouldn't work while ATIs' current architecture can.
Last edited by a moderator: