AMD: R7xx Speculation

Status
Not open for further replies.
The low-end part needs full-speed video-decode and should be able to deal with a full-width PCIe connection.
With PCI Express 2.0 providing 2x the bandwidth per lane, and with the inherently low performance of a $50 part, do you think full 16x lane PCI Express is a priority for a single chip?

I don't.

Jawed
 
I can see why a multi-die solution has certain merits, but if we assume for a moment that the data present here is anywhere close to reality, the result will overall be a disappointment.

It cannot be disputed that a multi-die solution has a number of inefficiencies that, if at all, can only be overcome by throwing a lot of hardware at it:
  • Memory controller area can only scale in lockstep with the amount of shader power
  • inter-die communication will eat up a bunch of die area (and power)
  • inter-die access latency will require much larger fifo's across the board
  • in case of MCM: much more complex package without tangible benefits to combat heat removal
  • Cheap (one die) solutions will have to carry the overhead of multi-die monsters

The arguments in favor are mostly better yields and lower development costs because only need to develop one die.

At a time where gross margins are at a record high even though die sizes are too (!), I don't understand the fascination with bad yields. By now, the recipe to combat this should be well known: intra-unit redundancy (like the ALU redundancy in R600, redundant memory columns and/or row etc.) and inter-unit redundancy (like the TCP's and memory controllers in G80).

Let's take the imaginary number of 4 dies and assume (hope!) that the solution AMD is shooting for is not based on a cooperative software scheme like Crossfire that requires careful per game tuning and inherent inefficiencies.
It's not unreasonable that the total overhead to make all that work efficiently will be 10% compared to single-die solution. 30mm2 is a lot of area to work with to add additional redundancy on a die! Even better: use 5mm2 on the redundancy and the remaining 25mm2 for additional performance. ;)

Even if AMD goes for the multi-die solution, I can't believe they would use the same chip for the low-end part. There must be tons of elements within a GPU that are sized either for performance or for area efficiency.

Multi-chip solutions may become a necessity longer term to manage excessive heat, but I don't think we're at that point yet, and in that case the inter-chip interface will be even more complex, since a multi-die package won't solve anything in that case.

With all that, I don't see why I should be excited about multi-die?
 
It's not clear to me why multichip should really help with heat.

Multi-board sort of does because you have two coolers dissipating a given amount of heat.
An MCM would just have 4 chips with 1/4 the total output each under one cooler, same as one chip 4x as large.
Actually, it would likely be worse with the MCM, depending on the power draw of the MCM interconnect.

I don't see any particular reason to be excited about multichip, either.

I think it's mostly a judgement call on the manufacturability of chips at future nodes, where design costs, the expense of wafer starts, defects, reliability, and variability are expected to worsen significantly.

It certainly won't help performance, just allow a given amount of performance to be feasible for production.
 
It cannot be disputed that a multi-die solution has a number of inefficiencies that, if at all, can only be overcome by throwing a lot of hardware at it:
  • Memory controller area can only scale in lockstep with the amount of shader power
That's probably what you want though, isn't it? Memory system performance should scale with GPU capability. Obviously there are tweaks within performance categories, e.g. between DDR2 and GDDR3 memory speeds.

  • inter-die communication will eat up a bunch of die area (and power)
  • inter-die access latency will require much larger fifo's across the board
Both of these issues also arise with a monolithic, 400mm2+ GPU. I agree that the "wastage" on multi-chip is higher than with monolithic. It'd be interesting to draw curves for both types and see where they meet - I presume they would meet. 600mm2 for the monolithic GPU? 800mm2?... With leakier, finer, processes presumably the priority is to keep die size down (or "transistor count" down), so perhaps a collection of smaller dies is an advantage?

What about chip stacking? Does a multi-chip solution have an eye on the future where stacking becomes possible, with the temptation of huge inter-chip bandwidth?


Let's take the imaginary number of 4 dies and assume (hope!) that the solution AMD is shooting for is not based on a cooperative software scheme like Crossfire that requires careful per game tuning and inherent inefficiencies.
It's not unreasonable that the total overhead to make all that work efficiently will be 10% compared to single-die solution. 30mm2 is a lot of area to work with to add additional redundancy on a die! Even better: use 5mm2 on the redundancy and the remaining 25mm2 for additional performance. ;)
Supposedly ATI already has across-the-die fine-grained redundancy - not just in ALUs/RF but in all the major functional blocks. So ATI doesn't have any more to gain, they've reached the knee on that redundancy graph.

Even if AMD goes for the multi-die solution, I can't believe they would use the same chip for the low-end part. There must be tons of elements within a GPU that are sized either for performance or for area efficiency.
The same argument applies to the entire history of GPUs as far as I can tell. e.g. I'm convinced that the amount of texturing latency-hiding in the older GPUs - created by having "long pipelines" - was fixed at some "average worst case" and then not scaled as it was implemented across the entire range of GPUs based on that architecture, from the smallest to the biggest, no matter the ratios of ALU:memory clocks, or pipeline-length:typical-texturing latency.

I think it's similar to the way that GPUs have used "off the shelf" libraries, rather than being "fully custom" - whatever that means. GPUs evolve every 6 months and they have to be implementable at more than one fab with potentially differing libraries. etc. All that stuff has forced GPU design to be "wasteful" in certain parameters - the payoff has been the huge sales (far higher than x86 CPUs in the same time frame).

So, multi-die appears to be a different kind of "wasteful compromise", in the same tradition.

With all that, I don't see why I should be excited about multi-die?
It's technically interesting, yeah. But I hate the prospect of the software minefield. I think that's by far the greatest demerit of multi-chip, making all this chip level stuff pale in comparison.

Jawed
 
I think it's similar to the way that GPUs have used "off the shelf" libraries, rather than being "fully custom" - whatever that means. GPUs evolve every 6 months and they have to be implementable at more than one fab with potentially differing libraries. etc. All that stuff has forced GPU design to be "wasteful" in certain parameters - the payoff has been the huge sales (far higher than x86 CPUs in the same time frame).
Hubba-wa? Can I have some of what you're smoking or can you back up that comparison/statistic in some meaningful way? :???:
 
IS there any chance that AMD/ATI may be pursuing a stacked chip technology in the near future considering the multi core concept in this discussion?
 
[/LIST]That's probably what you want though, isn't it? Memory system performance should scale with GPU capability. Obviously there are tweaks within performance categories, e.g. between DDR2 and GDDR3 memory speeds.

I'm not so sure, to be honest. The performance numbers of R600 vs RV670 and G80 vs G92 shows us that memory bandwidth by itself is only part of the equation, right? Dave pointed out that latency related changes were a big factor in the improved performance of RV670.

Let's assume that the single die has a 128-bit memory bus. In this case, your multi-die solution will suffer from the large inter-die latencies, and your single die (which, at 300M transistors, is still a pretty fast part) may be bandwidth starved. Put larger MC's on there and you're 4 die solution will have unused MC's (assuming larger than 512 bits isn't very practical.)

There's also the thing about redundancy: with a 128-bit bus per die, your granularity options are limited to 1 out of 4 at best, which is quite low. A monolithic chip with a 512-bit bus can have 1 out of 8 granularity or even 1 out of 16 with 32-bit controllers.

GDDR5 with 5 GBps per pin may change equations here, of course.

I'm just not sure memory bandwidth needs to increase linearly with other parameter. Could be more or less, I don't know, but going multi-die restricts those options.

Both of these issues also arise with a monolithic, 400mm2+ GPU.
Well, no: a monolithic chip will never have inter-die communication overhead. ;)
There is really a world of difference between getting a signal across a die and stepping out of bounds. Of the parameters available, you have to assume that at least 1 of them will get worse by an order of magnitude.

With leakier, finer, processes presumably the priority is to keep die size down (or "transistor count" down), so perhaps a collection of smaller dies is an advantage?
If the reference point is equal performance, your combination of smaller dies will still be larger, right? So aggregate leakage will be too. What then have your really gained?

What about chip stacking? Does a multi-chip solution have an eye on the future where stacking becomes possible, with the temptation of huge inter-chip bandwidth?
I think the problem there will be power and heat.
Flip-chip at least provides you with a way to have an equal distribution of power to different parts of the chip. I don't think there's a way to combine die stacking with flip chip (but then I'm not a packaging expert.)
As for heat: you really want some kind of metal in close contact with your die. Here also, die stacking would prevent this.

Supposedly ATI already has across-the-die fine-grained redundancy - not just in ALUs/RF but in all the major functional blocks. So ATI doesn't have any more to gain, they've reached the knee on that redundancy graph.
We've discussed the fine-grained redundancy before. Other than for ALU's and memories, I consider it unlikely... (One reason is because I wouldn't know how to do it, which, I admit, is a really weak argument! ;))

The larger the amount of interchangeable identical blocks, the better your ability to tune yield vs performance vs price. There's a world of difference between a 1/4 vs 1/8 granularity. In some perverse way, a larger die can be much more flexible in this way than multiple small ones: if the rumors for 8800GT are true, it was decided late in the game to go from 96 to 112 SP's. You can't decide at the last moment to add another die.
R600 has shown little evidence of unit granularity. (If they could, don't you think we'd have seen 448bit version of R600?) So I don't buy there's nothing more to gain.

The same argument applies to the entire history of GPUs as far as I can tell. e.g. I'm convinced that the amount of texturing latency-hiding in the older GPUs - created by having "long pipelines" - was fixed at some "average worst case" and then not scaled as it was implemented across the entire range of GPUs based on that architecture, from the smallest to the biggest, no matter the ratios of ALU:memory clocks, or pipeline-length:typical-texturing latency.

Didn't RV610 and RV630 have some differences compared to R600 in terms of shader and texture ratios? RV610 doesn't even have an almighty ring bus! There's differences in video decoders etc. Maybe some caches or other memories are differently sized?

It's technically interesting, yeah. But I hate the prospect of the software minefield. I think that's by far the greatest demerit of multi-chip, making all this chip level stuff pale in comparison.
Ha, yes, I forgot that: it does excite me, but indeed mostly because of technical curiosity.
 
Ok, thx for the response silent_guy. I seem to recall though that AMD was planning a completly new chip design for 2009. I recall whispers of 10 watts a core as the aim - wouldn't that be sufficient to keep heat under control?

IBM's new method, which uses current semiconductor manufacturing technology, involves creating tiny holes called "through-silicon-vias" that are etched all the way through a chip and then injected with tungsten to create wires. "This allows us to move 3-D chips from the lab," said Lisa Su, vice president, semiconductor research and development center.

Dave Lammers, director of research for WeSRCH.com, a Web site for semiconductor engineers based in Austin, Texas, said, "Honestly, this is historic. For the first time, we're exploiting the vertical dimension."

Mr. Lammers said there has been considerable discussion about 3-D semiconductors, but no manufacturer had previously said it was ready to start selling chips. He said microprocessor titans Intel Corp. and Advanced Micro Devices Inc. are both believed to be working on 3-D chip technology for future microprocessors, which he predicted will appear in 2009 or 2010.

http://online.wsj.com/public/article/SB117633745391667111-s93RVktuIlHP3noMcTeax6owiNM_20070419.html

http://www.physorg.com/news95575580.html
 
Interesting that some are worried about power/heat of R700 when AMD/ATi stated that RV670 was just the start of the power consumption decrease.
8m8s
 
What's the chances of a special function die holding eDRam and the ROP's? Where as you have a cluster of much smaller dies containing primiarly Texture engines and Shaders. Kind of the same way Xenos works, but with a role reversal of the doughter die acting as the parant die instead. Containing the I/O, UVD, majority of the interchip communication, the memory contoller, possibly eDRam, and render back ends.
 
With PCI Express 2.0 providing 2x the bandwidth per lane, and with the inherently low performance of a $50 part, do you think full 16x lane PCI Express is a priority for a single chip?

I don't.

Jawed

I think it is if this low-end chip is also the master for your high-end chip. Does dealing with an x16 link split across multiple chips, or splitting a command stream across multiple x8/x4 links really make much sense? And how much does a full x16 link cost in the scheme of things?
 
Although Intel presented their 80-core 3D IC chip in ISSCC early this year, I still doubt they can make a commercial chip with 3D IC by 2009. 3D IC design methodology is not going to reduce power dissipation. Instead, the high power density is the biggest problem designers are fighting against.

In case people are confused, the 3D IC is to stack transistor in the 3D space. It has nothing to do with 3D graphics.

Ok, thx for the response silent_guy. I seem to recall though that AMD was planning a completly new chip design for 2009. I recall whispers of 10 watts a core as the aim - wouldn't that be sufficient to keep heat under control?



http://online.wsj.com/public/article/SB117633745391667111-s93RVktuIlHP3noMcTeax6owiNM_20070419.html

http://www.physorg.com/news95575580.html
 
Hi;

I'm wondering a little bit about the eDRAM discussion here. It seems everyone assumes that eDRAM can only be used to contain the Z-buffer and the backbuffer/framebuffer.

Why is it not possible to use eDRAM to contain the vertex-buffer? Something like the old idea to have a "delay-stream" between vertex shaders and pixel shaders or maybe even a pool of unprocessed vertex. This could allow the efficient distribution of vertex between the 2-4 "sub-processors". Or is something like that not necessary/possible?

Manfred

Link: http://citeseer.ist.psu.edu/aila03delay.html [Upps...Hybrid is now part of Nvidia!]
 
I'm not so sure, to be honest. The performance numbers of R600 vs RV670 and G80 vs G92 shows us that memory bandwidth by itself is only part of the equation, right? Dave pointed out that latency related changes were a big factor in the improved performance of RV670.
G80vG92 is easy to deal with, G92 is bandwidth starved at 2560x1600.

As for R600 v anything, the fact is R600's bandwidth is a clear misdirection for the performance it's delivering. It appears there's corner cases where 105.6GB/s is paying off, but one look at RV670 shows this as an irrelevance.

So, R600 is a useless baseline. It's far more sensible to compare R580 and RV670. 64GB/s to 72GB/s. Although I could argue that R580 was squandering its 64GB/s and performed basically the same at 49.6GB/s (X1950XTX v X1900XTX - about 5% improvement, though about 10%+ with HDR).

From generation to generation I expect GPUs to be more bandwidth efficient (7600GT is a great example in comparison with 6800U) but within the same family the architecture scales very closely with bandwidth.

Let's assume that the single die has a 128-bit memory bus. In this case, your multi-die solution will suffer from the large inter-die latencies, and your single die (which, at 300M transistors, is still a pretty fast part) may be bandwidth starved. Put larger MC's on there and you're 4 die solution will have unused MC's (assuming larger than 512 bits isn't very practical.)
300M transistors is smaller than RV630 - a part that has "too much" bandwidth (much like R600 though not to quite the same extreme).

Why is larger than 512 bits impractical? The traditional problem with 512 bits was getting that many connections to a single die. In effect a power versus data battle in terms of pad count/density. I admit the main printed circuit board goes up in complexity, but it seems to me that with multi-chip this complexity is on a gentler ramp than a single-die.

There's also the thing about redundancy: with a 128-bit bus per die, your granularity options are limited to 1 out of 4 at best, which is quite low. A monolithic chip with a 512-bit bus can have 1 out of 8 granularity or even 1 out of 16 with 32-bit controllers.
A 128-bit bus is not monolithic - you could deliver SKUs with a 64-bit bus if you want, using only half. One peculiar example of this is RV570/560, two SKUs derived from the same die which is either configured as 256-bit or 128-bit.

But I do wonder if there'll be much calling for such a narrow bus. In theory the performance of a GPU such as RV610 (soon to be RV620) will be falling off the bottom of the chart in a couple of years. With Fusion in the picture too, the concept of a discrete GPU with a 64-bit bus sounds like it has a limited life. Though the continued cost reduction associated with GDDR3 etc. (i.e. more bandwidth per pin) might make 64-bit live for a long time. Hmm...

Though, ahem, maybe R700 is based on 64-bit MC chips...

GDDR5 with 5 GBps per pin may change equations here, of course.
Ha, yeah.

Well, no: a monolithic chip will never have inter-die communication overhead. ;)
There is really a world of difference between getting a signal across a die and stepping out of bounds. Of the parameters available, you have to assume that at least 1 of them will get worse by an order of magnitude.

If the reference point is equal performance, your combination of smaller dies will still be larger, right? So aggregate leakage will be too. What then have your really gained?
Supposedly one of the benefits of the ring-bus was the removal of hotspots. Seemingly smaller chips are easier to clock higher, which I presume is a hotspot issue. Though I don't have a good understanding of the effects of hotspots on large dies...

I think the problem there will be power and heat.
Flip-chip at least provides you with a way to have an equal distribution of power to different parts of the chip. I don't think there's a way to combine die stacking with flip chip (but then I'm not a packaging expert.)
As for heat: you really want some kind of metal in close contact with your die. Here also, die stacking would prevent this.
I'm expecting something like micro-channel liquid cooling or something similar to play a part.

We've discussed the fine-grained redundancy before. Other than for ALU's and memories, I consider it unlikely... (One reason is because I wouldn't know how to do it, which, I admit, is a really weak argument! ;))
I can't find the quote where one of the ATI engineers asserts widespread fine-grained redundancy :cry: Fine-grained ALU-only redundancy is so 2005...

The larger the amount of interchangeable identical blocks, the better your ability to tune yield vs performance vs price. There's a world of difference between a 1/4 vs 1/8 granularity. In some perverse way, a larger die can be much more flexible in this way than multiple small ones: if the rumors for 8800GT are true, it was decided late in the game to go from 96 to 112 SP's. You can't decide at the last moment to add another die.
Hence those beasts called HD2600XTx2.

R600 has shown little evidence of unit granularity. (If they could, don't you think we'd have seen 448bit version of R600?) So I don't buy there's nothing more to gain.
I think the wide-spread fine-grained redundancy makes that moot. Cut-down R600's were/are in very very limited supply. It's either not worth testing/packaging/manufacturing all the combinations (because it was such a runt of a GPU) or the damn thing yielded practically every die as either fully functional or completely bust.

Didn't RV610 and RV630 have some differences compared to R600 in terms of shader and texture ratios? RV610 doesn't even have an almighty ring bus! There's differences in video decoders etc. Maybe some caches or other memories are differently sized?
Yes, RV610 is 2:1 and RV630 is 3:1. RV610 has no L2 cache. RV610 is also incapable of 8xMSAA.

It does concern me that R6xx is a costly "base architecture" - 390M transistors for RV630 is sort of ludicrous. Yet a vast number of those transistors are caught up in D3D10-specific functionality that you won't find in, say, RV570 (which has 330M transistors and is often faster - though tends to be slower in newer games). And what are the finer processes for, if not to add features and make newer games run faster?

But, I suspect a wodge of R6xx only really works when it goes multi-chip (the virtual memory stuff, the cache architecture, the ring bus). I think it's a bit like R5xx's ring bus, being asymmetric, it wasn't really the full story. We're apparently now looking at the logical conclusion to the ring bus story: a ring bus that encompasses multiple chips.

Jawed
 
I think it is if this low-end chip is also the master for your high-end chip.
I consider R700 to be a fully distributed GPU. "Master" only relates to a single chip that delivers the completed front buffer for display on your monitor(s). This does mean that the "master" GPU also controls the "gathering" of MSAA data when SuperAA is being performed.

There's a load of patent documents that relate to sending commands/data to multiple GPUs (i.e. CrossFire). I haven't spotted anything, so far, that makes one GPU "control" the others. The closest I've seen is how the northbridge (which could be a GPU, too) is where commands are distributed from. Of course I may be misinterpreting the patent documents...

Does dealing with an x16 link split across multiple chips, or splitting a command stream across multiple x8/x4 links really make much sense? And how much does a full x16 link cost in the scheme of things?
If you look at a die shot of Xenos you'll see that the area labelled "PCI Express" is pretty small - but I don't know how many lanes it is.

b3d35.jpg

(Ignore the 1,2,3.)

I think it's worth pointing out that ATI GPUs have historically shown very good performance with restricted PCI Express lanes - i.e. high end GPUs don't lose much performance with only 8 lanes. Though I haven't seen these tests with R600 or later.

Also, I think it's worth thinking of the PCI Express connection as being the route twixt CPU RAM and GPU RAM. In this case you want all your MCs to join in when moving data in either direction. So it seems to me that spreading the PCI Express lanes equally amongst the chips where there is an equal spread of MCs is prolly beneficial.

I don't know what's a good lane count per chip - and what the effect of two R7xx boards, CrossFired, would be on the way lanes are used when sharing data between the boards (apart from the fact that there will prolly be a CrossFire connector, which presumably has some kind of path to at least one of the chips on each R7xx board).

Jawed
 
I think that's a very small number of lanes. What it does is communicate with the SiS southbridge AFAIK, so it's <=4x. Heck, it might even be 1x! I'm not sure how much more expensive 16x would be though (does all the control logic have to be duplicated?)
 
Considering we're talking about multiple chips. What's the possibility that half of the chips would be on one side of the board and the other half on the other side of the board. I know memory can be done in this way with certain configurations.

Then again thinking about it. Since we're talking most likely that the chip are on a package. That might be awkward with package - board - package.

NM, probably a bad idea. :p

Regards,
SB
 
Status
Not open for further replies.
Back
Top