AMD: R7xx Speculation

Status
Not open for further replies.
One of the trueisms/mantras of engineering is something like "optimize for the common case, not the corner case". Maybe that idea is being applied here as well. Especially when that tiny percentage of high-end buyers have proven that cost isn't a major concern for them, you can start building into your models the idea that you can stick them with the extra memory costs associated with SLI/CF types of implementations.

Well based on my limited understanding of the factors in play here I don't think we are at the point where there is a cost or scaling benefit to going this route. The primary advantages seem to be reduced tape-out costs and possibly improved yields. However, isn't there an accompanying significant increase in R&D and manufacturing cost, I/O complexity and reduction in performance scalability?

Also does it even make sense at the lower end of the market? What real benefit is there if any of doing say a 2xRV610 midrange card instead of a single RV630? Wouldn't the former actually end up being more costly? You can amortize tape-out costs over the life of the chip but I don't see that happening with the increased BOM costs for this multi-chip design.

Am I missing something obvious here?
 
Has there been any analysis of CrossFire scaling in R600? I get the impression that CF scales much better in R600 than it does in R5xx (when it works, that is).

For what it's worth as I posted earlier in this thread,

http://forum.beyond3d.com/showpost.php?p=1007968&postcount=15

R700's multiple dies could be just like individual unified shader units in R600. Bear in mind that the R600 architectural diagrams do not necessarily show the correct count of units, apart from ALUs, TUs and RBEs. There's quite a few things that actually appear multiple times and so would be ripe for partitioning, one per die - if indeed R700 is a multi-die GPU.

This, then, would not even be a matter of CF across multiple dies. It would simply be "R600 split across dies", perhaps with a PCI-Express-interface, Command-Overseer, AVIVO, CF-interfacing die to act as the GPU's interface to the outside world (or another multi-chip GPU).

Jawed
 
CF does seem to be scaling better with R600 than R580 but for the most part it has just caught up to SLI. A quick glance at hothardware's or bit-tech's dual-GPU numbers would confirm that.

Jawed said:
This, then, would not even be a matter of CF across multiple dies. It would simply be "R600 split across dies", perhaps with a PCI-Express-interface, Command-Overseer, AVIVO, CF-interfacing die to act as the GPU's interface to the outside world (or another multi-chip GPU).

What happens to caching, texture/framebuffer access, inter-chip communication etc?
 
What happens to caching, texture/framebuffer access, inter-chip communication etc?
The R600 diagram doesn't show it, but I suspect R600 splits its texture units equally around the ring-stops, just as R5xx does.

So, each TU has its own vertex cache, its own L1 cache and, I guess, a portion of L2.

Similarly, R600's RBEs are distributed. Remember that R300 introduced the concept of screen-space tiling (which is the basis of supertiling in CF). This tiling means that there isn't one, but 4 hierarchical-Z buffers, 4 rasterisers, 4 RBEs, etc. in R600. Each tile has its own ring stop. Each tile can run independently of the others, sharing data as necessary. e.g. texels will sometimes be needed by more than 1 TU (which is where L2-cache-to-L2-cache sharing comes in, if indeed that's a capability of R600), or AA resolve will need to share AA samples across screen-space boundaries (which would be where read-write memory-caches would share with each other).

Inter-chip communication? Do you mean the ring bus?

It seems to me that the architecture in R600 is just a few snips away from the basic split approach. It's interesting to ponder what aspects of R600 aren't ready for this yet (well, I can't think of any, but there must be something)... A deeper understanding of the buses in R600 would help.

What's got me dubious is the physical implementation of the ring bus. Each chip needs to have 2 ring-bus ports, each port seems to need to be in the region of 50GB/s in each direction, I suspect.

Jawed
 
Inter-chip communication? Do you mean the ring bus?

Yeah just wondering what kind of inter-chip communication would be necessary for this kind of setup and what amount of bandwidth is practical between them. The comparison to SLI/CF doesn't help IMO because those methods don't require the chips to share resources or communicate directly.
 
Also does it even make sense at the lower end of the market? What real benefit is there if any of doing say a 2xRV610 midrange card instead of a single RV630? Wouldn't the former actually end up being more costly? You can amortize tape-out costs over the life of the chip but I don't see that happening with the increased BOM costs for this multi-chip design.

Am I missing something obvious here?

Well, then I wouldn't expect to see the scaling point chosen as being where BOM is going to be a major constriction on your ability to offer a competitive part at a competitive price. It just seems to me the top-to-bottom scope of both performance and price is stretching and stretching and stretching. Even today they aren't really scaling down from a single high-end gpu, because highest performance is actually multi-chip already and has been for over 2 years. So what should be the factors that decide where you pick to make that break between single chip and multi-chip when you already know that multi-chip will be your ultimate top-end? Certainly yield and tape-out are factors. BOM is too, but your flexibility on BOM is a bit different at different price points as you move up the food chain.

All of which is a long-winded way of saying I'd never expect to see "40 x RV610" as the ultimate performance high-end. :LOL: On the other hand, it is not totally obvious to me that R600/G80 is the best place to make that break either.
 
Also does it even make sense at the lower end of the market? What real benefit is there if any of doing say a 2xRV610 midrange card instead of a single RV630? Wouldn't the former actually end up being more costly? You can amortize tape-out costs over the life of the chip but I don't see that happening with the increased BOM costs for this multi-chip design.

Am I missing something obvious here?

RV610, RV630 and R600 are 3 separate designs, even if they incluse the same modular parts. You need to create 3 masks, you need 3 different wafers...

Now, if you just design each part of a GPU and then interconnect them to build an entry, performance, enthusiast gpu or even a pure calculation part, you have different wafers with different chips, used for your entire line of products, thus reducing costs.
 
As well as having to design only one chip instead of three, might there be yield advantages, too? If each individual chip is comparatively small and simple, there's less to go wrong. Or, if something does go wrong, you lose a smaller percentage of the wafer than you do if you have to throw away a large chip.

What would be the cooling implications? Is it easier to cool 4 small chips or one large one (assuming total heat output is similar)? Could smaller, simpler chips be individually clocked higher without going into thermal overload?
 
As well as having to design only one chip instead of three, might there be yield advantages, too? If each individual chip is comparatively small and simple, there's less to go wrong. Or, if something does go wrong, you lose a smaller percentage of the wafer than you do if you have to throw away a large chip.

What would be the cooling implications? Is it easier to cool 4 small chips or one large one (assuming total heat output is similar)? Could smaller, simpler chips be individually clocked higher without going into thermal overload?

Yield-wise, smaller chips can go through some validation prior to being packaged, so it's possible to cull out bad chips that would have lead to an entire larger die being discarded.

It's also possible to get better bin splits. Smaller geometries have a larger problem with component variation. There was a nifty map somewhere where someone used a color plot to show the clock speeds of various chips on a wafer.

Larger chips are more likely to have areas with transistors that can't hit high clocks, or leak.
Smaller chips can be mixed and matched.
Separate chips do have to face a higher amount of communication latency, so it may be worth having a larger low-clocked chip if communications delay cuts performance down.


Cooling:

Another source of heat is that a large number of well-connected but separate chips will be driving a lot of signal I/Os, which generate heat.

The only example I have to go by are the POWER MCMs, but those can burn nearly a kilowatt of power.

The cost of MCM packaging is another consideration. It's not cheap to make special packaging, and those can get complex and expensive.
 
psurge> Clovertown doesn't have die to die connection, they rely on the "Netburst" FSB for that.

I know - I mean it as an example of a product where multiple chips are used in a singe package (not of die to die interconnects). I was wondering if one combines that sort of setup with a die to die interconnect, whether some/much of the latency and power penalties of communicating across a PCB are eliminated...
 
RV610, RV630 and R600 are 3 separate designs, even if they incluse the same modular parts. You need to create 3 masks, you need 3 different wafers...

Now, if you just design each part of a GPU and then interconnect them to build an entry, performance, enthusiast gpu or even a pure calculation part, you have different wafers with different chips, used for your entire line of products, thus reducing costs.

Yeah that's exactly what we're discussing here. Whether the reduction in tape-out costs is greater than the increase in other costs required to make this approach work. And of course this isn't happening in a vacuum - you still need to maintain a certain level of performance in order to be able to price these things competitively. It doesn't help if your costs are lower than the competition if you can't compete on performance. I can see this happening down the road when process technology advancements slow down but I just don't see the impetus for it now. What would G80 be at 65nm? Somewhere near the size of G71?
 
Wlll G80 has already taken a baby step to multi chip GPU packages with breaking out some functionality in the NVIO chip. So it wouldn't surprise me if they have a multi-chip/multi-core GPU in the works. When it'll happen though is the big unknown.

Regards,
SB
 
I dunno this multi-chip thing doesn't sound good to me. ATI has a knack for introducing expensive unneeded features well before they're relevant of late. Who's to say this isn't another.
 
But isn't it already separated to an extent? At least the diagram would imply that the SPUs and ROPs are four separate entities.
Yes, sure. But I'm not sure if R600 is that much different from, say, R580, R520, G80 or G70. After all, those also made it possible to disable major blocks for cheaper versions, so there were definitely also separate entities, right?

On the other hand, I'd also think that full-speed texturing from non-local memory would require significantly more latency tolerance than would otherwise be the case, so that would adversely affect performance/mm2.
Yes, just think back about the comment of Bob in a previous thread: it can take around 100 clock cycles just to cross a die and back. With a multi-die arrangement, this alone would almost double, even if the overall dies were smaller than a big monolithic die (because you won't have a square form factor anymore.)

Is chip to chip latency be significantly reduced by placing multiple chips into a single package
My guess would be that the latency is more in the logic that's required to prepare the data to send and to recover the data in receive than in the delay on the wires: in the former case, your talking clock cycles at a time, in the latter, it's just a wave propagation.
That said, I assume there will be some aspects that are easier if you have tight control about placing them on the same substrate (like Xenos.)

Is it possible to build a compute tile (where in this case a compute tile is an entire GPU) where each tile connects to the top/bottom/left/right tile on the same wafer? That way, maybe one could actually cut different sized dies out of a wafer - you'd cut say 2x2 tiles for high-end dies, 1x1 for low end, 1x2 for midrange.
The major advantage would be the ability to have massive buses between tiles. I don't think it couldn't work in theory, but there's probably a big pile of practical complications: you'd need a bunch of strong drivers to cross the 150um die scribe, how to deal with power distribution, testability methodology, layout methodology, ESD issues, etc. and much more that I don't know about.

But some of you old timers know that I've been expecting this kind of thing that Inq is suggesting re R700 to become common for two years or more.
My bet would be on 'more'. We're currently not even at 65nm for the high-end.
 
you'd need a bunch of strong drivers to cross the 150um die scribe, how to deal with power distribution, testability methodology, layout methodology, ESD issues, etc. and much more that I don't know about.

Hmm, well, I'm well out of my depth, but, actually that sounds like an interesting idea. Instead of offset-stacking them, would it make sense to build two "chips" -- a compute chip, and an interconnect one, and then stagger/stack them:

...........===.........===
======..======..======

(please ignore alignment-dots :( )

Layout of a ring bus probably scales a tad further than layout of a crossbar, which is kind of an interesting, random thought....

-Dave
 
Hmm, well, I'm well out of my depth, but, actually that sounds like an interesting idea. Instead of offset-stacking them, would it make sense to build two "chips" -- a compute chip, and an interconnect one, and then stagger/stack them:

...........===.........===
======..======..======
(BACKSTAGE, the scream of the packaging expert, who now has to invent a completely new packaging technique.)

In word: no. The top of the die is already 'welded' upside down on to the substrate. Your connection die would have nothing to connect to. In fact, the substrate would be the agent that can already do what your connect chip is supposed to do.

Layout of a ring bus probably scales a tad further than layout of a crossbar, which is kind of an interesting, random thought....
For some it's windmills, for others it's the ring bus: I'm always more than willing to reopen the debates on my favorite topic, but I'm not convinced others would agree, so let's leave it at that. ;)
 
The major advantage would be the ability to have massive buses between tiles. I don't think it couldn't work in theory, but there's probably a big pile of practical complications: you'd need a bunch of strong drivers to cross the 150um die scribe, how to deal with power distribution, testability methodology, layout methodology, ESD issues, etc. and much more that I don't know about.

Like a 1024 bit ring bus link you mean? :). Thanks for indulging a software guy with some gory HW details BTW. Assuming for the sake of pipe dreaming that the issues you mention were surmountable, how hard would it be to decently test a tile before cutting out the final dies? I was thinking that ideally you might want to cut up the wafer based on knowledge of whether individual tiles were functional or not, e.g. if you had 1 defective tile in a 2x2 grid, you'd turn that into a 1x1 and a 1x2 die or 3 1x1 dies, rather than having a SKU for 2x2 tile dies with 1 defective tile.

The other possible advantage I was thinking of was that you should be able to react to changing demand more quickly: if nobody wants your high-end die because the competition just released the MegaBlaster 10000, you don't suddenly have a bunch of useless high-end wafers on your hands (or still making their way through the fab) - just cut them up differently to make mid-range dies...
 
Assuming for the sake of pipe dreaming that the issues you mention were surmountable, how hard would it be to decently test a tile before cutting out the final dies?
Normally, dies are tested after cutting, but before packaging. Cutting by itself can put some mechanical stress on the die and result in failures. But there's no reason why you can't test before cutting. Your only problem would be that you may have larger fallout in after-packaging testing, which will increase overall cost.

Much more of a problem: you can't cut out individual dies because you're using a circular saw that cuts across the whole wafer. (Basically, you first attach the wafer to a plastic that's stretched flat on a hollow ring. The saw will cut deep enough to separate the dies, but it doesn't cut through the plastic. Individual dies are separated only 150um from each other, so the saw is extremely precise.)

I was thinking that ideally you might want to cut up the wafer based on knowledge of whether individual tiles were functional or not, e.g. if you had 1 defective tile in a 2x2 grid, you'd turn that into a 1x1 and a 1x2 die or 3 1x1 dies, rather than having a SKU for 2x2 tile dies with 1 defective tile.
Your only option would be to first test all the dies and then make some trade-off to decide how many 2x2 SKU's with defective pieces are acceptable and modify the cutting process accordingly. This would require a major change in the existing flow (right now, there is simply no control link between cutting and testing.)

The other possible advantage I was thinking of was that you should be able to react to changing demand more quickly: if nobody wants your high-end die because the competition just released the MegaBlaster 10000, you don't suddenly have a bunch of useless high-end wafers on your hands (or still making their way through the fab) - just cut them up differently to make mid-range dies...
I'm sure both competitors also read TheInq religiously. They should be able to figure out that a competitor is about to release a new high-end killer at least a month or two up front, no? ;)

It's a neat idea, but I'm afraid it isn't going to happen anytime soon.
 
A multi-chip architecture seems like it would need to be based on at least two "base chips".

The baby chip would have only 1 ring bus port (due to limited die size), so it could talk to at most one neighbour. This would result in 1-chip and 2-chip GPUs. It wouldn't be a ring bus as such, just an interconnect. A dedicated port for PCI Express would be needed on each chip. It might be simplest to make each chip have an 8-lane port so that the two, together, share the 16-lane PCI Express bus to the CPU. Each chip might have a 32-bit or a 64-bit port to memory.

The midrange chip would have enough area to support 2 ring bus ports. Memory would be a 64-bit port. This chip would form the basis of all the mainstream and better GPUs.

Jawed
 
Yes, just think back about the comment of Bob in a previous thread: it can take around 100 clock cycles just to cross a die and back. With a multi-die arrangement, this alone would almost double, even if the overall dies were smaller than a big monolithic die (because you won't have a square form factor anymore.)
That's a round trip with a loaded ring bus, right?
There are examples of more straightforward signal passage crossing wide dies in tens of clock cycles (possibly less) on chips with much higher clocks.
 
Status
Not open for further replies.
Back
Top