How about a little R700 speculation?

Geo

Mostly Harmless
Legend
http://www.theinquirer.net/default.aspx?article=35818

Not our favorite source, certainly, but against type for them as well, as they tend to like ZOMG bigger, bigger, faster, faster exaggerations rather than this kind of thing.

ATI R600 will represent the last of it's breed, the monolithic GPU.

It will be replaced by a cluster of smaller GPUs with the R700 generation. This is the biggest change in the paradigm since 3Dfx came out with SLI, oh so many years ago.


It takes a good bit of software magic to make this work, but word has it that ATI has figured out this secret sauce. What this means is R700 boards will be more modular, more scalable, more consistent top to bottom, and cheaper to fab. In fact, when they launch one SKU, they will have the capability to launch them all. It is a win/win for ATI.

It would take a good bit of software magic to work as reliably as it should to be a mass market thing. But then the trend has been in this direction for some time, so I'm unwilling to dismiss this out of hand as the Inq smoking the good sh*t again. Particularly for AMD, as this type of modular approach would potentially be supportive of Fusion, and even match up with what we all thought was "artistic license" in some of their recent drawings of various usage scenarios where for instance in a graphics heavy environment they are showing 4 gpu blocks. . .
 
The most obvious problem with such a setup would be the problem of retaining scalability of vertex/geometry shader performance. If you partition the framebuffer to maximize pixel shader scalability, you need to either run vertex shading in every core (breaks scalability, as your polygon rate will never be able to exceed what a solitary core on its own can deliver) or distribute vertex shader results across all cores (requires some fairly massive buses; this is fine with on-chip buses in monolithic chips, but sounds quite painful - both to coordinate and to actually move around - once you need to cross chip boundaries.) This may not be the end of the world if you are only partitioning the framebuffer across 2-4 cores, but the smaller you intend to make each individial core, the worse it gets.

There is also the issue that you may need to replicate textures, vertex arrays etc to the local memory modules of each core; this would act as a cost multiplier for onboard memory as well.
 
If this thing is possible...

How if it was connected directly to the CPU core as a modular part of CPU? Thus, not only they get GPU product but also another part of CPU product too. Imagine if integrated GPU module on CPU core can be modular... if you need more graphic power... just use more modular module... If yes, this would be one of the main reason AMD bought ATi.

Sound more interesting in the next few years :eek:

Edit: typo and add some points...
 
Last edited by a moderator:
The most obvious problem with such a setup would be the problem of retaining scalability of vertex/geometry shader performance. If you partition the framebuffer to maximize pixel shader scalability, you need to either run vertex shading in every core (breaks scalability, as your polygon rate will never be able to exceed what a solitary core on its own can deliver) or distribute vertex shader results across all cores (requires some fairly massive buses; this is fine with on-chip buses in monolithic chips, but sounds quite painful - both to coordinate and to actually move around - once you need to cross chip boundaries.) This may not be the end of the world if you are only partitioning the framebuffer across 2-4 cores, but the smaller you intend to make each individial core, the worse it gets.

There is also the issue that you may need to replicate textures, vertex arrays etc to the local memory modules of each core; this would act as a cost multiplier for onboard memory as well.

Well, the reference to "secret sauce" on the software side is intriguing. Have they figured out an approach to address some of these issues in part or in whole, or just wishful thinking?
 
maybe that'll be practical in the future with optic interconnects? :p

about replicated vertex work : why not. 3dfx SLI (voodoo 2 and VSA/100) would do the triangle setup redundantly, that's quite inefficient on a voodoo5 6000 or the octo-VSA/100 monsters, but ok if you pretend what you need is more pixel power for the high res and you don't matter high end being less efficient. (afterall, it's still faster, and it's meant to be expensive).
Texture replication : throw RAM quantity at it (the very same happens with current SLI/crossfire)

the thing is, though, I'm kind of describing current multi GPU. .
 
Last edited by a moderator:
I wanted to post that first, but u beat me to it. I have a question, will these GPUs be multicore or multichip? All this explains why AMD showed two CPU cores next to 2 GPU cores on Fusion.
 
How different is this, really, from a GPU with a set of clusters? You have to figure out how to handle rasterization, and you pull the L2 cache a bit further away, but, this struck me as the future of SLI back when I was reading the patents....

or distribute vertex shader results across all cores (requires some fairly massive buses; this is fine with on-chip buses in monolithic chips, but sounds quite painful

Stuff the rasterizer and Z in a "master" and distribute the pixel load.... In a perfect world the total bandwidth would be ~ size of the framebuffer plus overdraw, no? A few Gb/s? Big, but, not unmanageable.... Afterall, the current SLI connect needs to shuffle the completed framebuffer around, so, the difference is mainly in overdraw plus whatever load-balancing the geom-shader output requires.
 
Ooh, this is going to be fun to speculate about :!: :D

The most obvious problem with such a setup would be the problem of retaining scalability of vertex/geometry shader performance. If you partition the framebuffer to maximize pixel shader scalability, you need to either run vertex shading in every core (breaks scalability, as your polygon rate will never be able to exceed what a solitary core on its own can deliver) or distribute vertex shader results across all cores (requires some fairly massive buses; this is fine with on-chip buses in monolithic chips, but sounds quite painful - both to coordinate and to actually move around - once you need to cross chip boundaries.)
Yeah, we know that Supertiling is often not much of a performance gain in CrossFire, because of the "duplicate computation" and "duplicated data" problems.

This may not be the end of the world if you are only partitioning the framebuffer across 2-4 cores, but the smaller you intend to make each individial core, the worse it gets.
My first thought here is that you have a hierarchy of ring-buses. You have a ring-bus for chip-to-chip data sharing, with each chip having a smaller-scale ring-bus (smaller scale compared with R580) because it interfaces to less memory chips, i.e. needs less ring stops.

The smallest it's possible to make each core is the width of a single memory channel, I suppose, 32-bits with GDDR3/4. i.e. if we assume a 256-bit GDDR4 R700, then that's eight cores.

At this point I can't help observing that one way of easily building enthusiast GPUs with 512-bit buses is to use multiple cores. That's what SLI/Crossfire already does, in effect. The total area of these cores (which will have to include communications to other cores) will be far higher than a single core with a 512-bit bus, but wafer yield should be much easier when each core is smaller.

In R600 I'm expecting to see each cluster having to distribute VS/GS results to those other clusters that need to rasterise the resulting fragments. Presumably G80 already does this, but it appears to have a single rasteriser for the entire GPU. R600, I guess, distributes the rasterisers, one per cluster. This means that R600 will move around significantly less data twixt clusters, since it's per-vertex data (coordinates+attributes) rather than per-fragment data that needs distributing.

There is also the issue that you may need to replicate textures, vertex arrays etc to the local memory modules of each core; this would act as a cost multiplier for onboard memory as well.
G80 seems to have 8x large L1 caches already (one per cluster), 8KB or 16KB, not sure - in addition to L2 (which is split into 6x equal-sized lumps, one per memory channel). It would seem that data will end-up being replicated to several L1s quite a lot of the time, simply through the sharing of textures/constants/vertices by multiple threads in flight. Replication of data seems to be a cost of D3D10 at the high end as pipeline utilisation is more important.

Jawed
 
This makes perfect sense from a yield standpoint, now they can their baselevel design should match pretty closely the most efficient die area.

And I've been thinking that with the advent of unified shaders even distributing work across the chips shouldn't be nearly as bad as a traditional GPU. There could be a master chip which dedicates all of its shader units to doing vertex shader work, and then ships the data off to the other chips for fragment processing. Essentially it might look something a lot like AMDs HT connecting the chips, and each of them having a small pool of memory they operate in/from but don't need to allocate every texture or the whole framebuffer into, since they can fetch those from their neighbors if neccessarry.

Of course making this scale efficiently beyond 2 chips would probably take a whole heck of a lot of work.
 
Is there a possibility that we may see a little bit of swarm processing in action with the next generation? Multiple processors swarming in to work on the rendering that requires the most time. Sort of auto load balancing?
 
And I've been thinking that with the advent of unified shaders even distributing work across the chips shouldn't be nearly as bad as a traditional GPU. There could be a master chip which dedicates all of its shader units to doing vertex shader work, and then ships the data off to the other chips for fragment processing.
Eh? You've just described a non-unified multi-chip traditional GPU. What's the point of that?

Jawed
 
let say that going unified does not seem to be the smartest thing you can do if suddenly you want to distribute your geometry workload..
 
You could do a Power VR and Tile, and bin.
But since Vertex shaders can do anything to the input values, you would have to do it post transform.
I guess you could use something like stream out to buffer the entire scene (or portions of it) post transform and use the same unified chips to cycle between vertex and pixel work.
<shrug>
 
Yeah..it just gets very bw hungry, a post transformed vertex can be very fat :)
EDIT: even though in the end it could eat a few gigabytes per second on complex scenes, nothing that we can't afford in the next future..
 
Last edited:
Yeah..it just gets very bw hungry, a post transformed vertex can be very fat :)

Yes, I agree, but if we assume they bound the output buffer and potentially loop over the scene many times to complete it seems like it might not be a bad use for on chip or on die specialised RAM 2 or 3 Mb's would probably allow you to process most scenes in under 5 or 6 passes.

I agree though it's a lot of messing around.
 
as soon as i read the article i sent Charlie a message, asking him whether this was multicore or multichip, and he said that it is multichip. Now the question is what will happen for Fusion, how will AMD design its GPU? since the time of Fusion will be the time of R700, there is still a lot to find out. And plz be informed that what he said was just prediction and the actual desing would vary.
 
Back
Top