AMD: R7xx Speculation

Discussion in 'Architecture and Products' started by Unknown Soldier, May 18, 2007.

Thread Status:
Not open for further replies.
  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,428
    Likes Received:
    426
    Location:
    New York
    Well based on my limited understanding of the factors in play here I don't think we are at the point where there is a cost or scaling benefit to going this route. The primary advantages seem to be reduced tape-out costs and possibly improved yields. However, isn't there an accompanying significant increase in R&D and manufacturing cost, I/O complexity and reduction in performance scalability?

    Also does it even make sense at the lower end of the market? What real benefit is there if any of doing say a 2xRV610 midrange card instead of a single RV630? Wouldn't the former actually end up being more costly? You can amortize tape-out costs over the life of the chip but I don't see that happening with the increased BOM costs for this multi-chip design.

    Am I missing something obvious here?
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Has there been any analysis of CrossFire scaling in R600? I get the impression that CF scales much better in R600 than it does in R5xx (when it works, that is).

    For what it's worth as I posted earlier in this thread,

    http://forum.beyond3d.com/showpost.php?p=1007968&postcount=15

    R700's multiple dies could be just like individual unified shader units in R600. Bear in mind that the R600 architectural diagrams do not necessarily show the correct count of units, apart from ALUs, TUs and RBEs. There's quite a few things that actually appear multiple times and so would be ripe for partitioning, one per die - if indeed R700 is a multi-die GPU.

    This, then, would not even be a matter of CF across multiple dies. It would simply be "R600 split across dies", perhaps with a PCI-Express-interface, Command-Overseer, AVIVO, CF-interfacing die to act as the GPU's interface to the outside world (or another multi-chip GPU).

    Jawed
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,428
    Likes Received:
    426
    Location:
    New York
    CF does seem to be scaling better with R600 than R580 but for the most part it has just caught up to SLI. A quick glance at hothardware's or bit-tech's dual-GPU numbers would confirm that.

    What happens to caching, texture/framebuffer access, inter-chip communication etc?
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The R600 diagram doesn't show it, but I suspect R600 splits its texture units equally around the ring-stops, just as R5xx does.

    So, each TU has its own vertex cache, its own L1 cache and, I guess, a portion of L2.

    Similarly, R600's RBEs are distributed. Remember that R300 introduced the concept of screen-space tiling (which is the basis of supertiling in CF). This tiling means that there isn't one, but 4 hierarchical-Z buffers, 4 rasterisers, 4 RBEs, etc. in R600. Each tile has its own ring stop. Each tile can run independently of the others, sharing data as necessary. e.g. texels will sometimes be needed by more than 1 TU (which is where L2-cache-to-L2-cache sharing comes in, if indeed that's a capability of R600), or AA resolve will need to share AA samples across screen-space boundaries (which would be where read-write memory-caches would share with each other).

    Inter-chip communication? Do you mean the ring bus?

    It seems to me that the architecture in R600 is just a few snips away from the basic split approach. It's interesting to ponder what aspects of R600 aren't ready for this yet (well, I can't think of any, but there must be something)... A deeper understanding of the buses in R600 would help.

    What's got me dubious is the physical implementation of the ring bus. Each chip needs to have 2 ring-bus ports, each port seems to need to be in the region of 50GB/s in each direction, I suspect.

    Jawed
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,428
    Likes Received:
    426
    Location:
    New York
    Yeah just wondering what kind of inter-chip communication would be necessary for this kind of setup and what amount of bandwidth is practical between them. The comparison to SLI/CF doesn't help IMO because those methods don't require the chips to share resources or communicate directly.
     
  6. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    213
    Location:
    Uffda-land
    Well, then I wouldn't expect to see the scaling point chosen as being where BOM is going to be a major constriction on your ability to offer a competitive part at a competitive price. It just seems to me the top-to-bottom scope of both performance and price is stretching and stretching and stretching. Even today they aren't really scaling down from a single high-end gpu, because highest performance is actually multi-chip already and has been for over 2 years. So what should be the factors that decide where you pick to make that break between single chip and multi-chip when you already know that multi-chip will be your ultimate top-end? Certainly yield and tape-out are factors. BOM is too, but your flexibility on BOM is a bit different at different price points as you move up the food chain.

    All of which is a long-winded way of saying I'd never expect to see "40 x RV610" as the ultimate performance high-end. :lol: On the other hand, it is not totally obvious to me that R600/G80 is the best place to make that break either.
     
  7. PSU-failure

    Newcomer

    Joined:
    May 3, 2007
    Messages:
    249
    Likes Received:
    0
    RV610, RV630 and R600 are 3 separate designs, even if they incluse the same modular parts. You need to create 3 masks, you need 3 different wafers...

    Now, if you just design each part of a GPU and then interconnect them to build an entry, performance, enthusiast gpu or even a pure calculation part, you have different wafers with different chips, used for your entire line of products, thus reducing costs.
     
  8. nicolasb

    Regular

    Joined:
    Oct 21, 2006
    Messages:
    421
    Likes Received:
    4
    As well as having to design only one chip instead of three, might there be yield advantages, too? If each individual chip is comparatively small and simple, there's less to go wrong. Or, if something does go wrong, you lose a smaller percentage of the wafer than you do if you have to throw away a large chip.

    What would be the cooling implications? Is it easier to cool 4 small chips or one large one (assuming total heat output is similar)? Could smaller, simpler chips be individually clocked higher without going into thermal overload?
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Yield-wise, smaller chips can go through some validation prior to being packaged, so it's possible to cull out bad chips that would have lead to an entire larger die being discarded.

    It's also possible to get better bin splits. Smaller geometries have a larger problem with component variation. There was a nifty map somewhere where someone used a color plot to show the clock speeds of various chips on a wafer.

    Larger chips are more likely to have areas with transistors that can't hit high clocks, or leak.
    Smaller chips can be mixed and matched.
    Separate chips do have to face a higher amount of communication latency, so it may be worth having a larger low-clocked chip if communications delay cuts performance down.


    Cooling:

    Another source of heat is that a large number of well-connected but separate chips will be driving a lot of signal I/Os, which generate heat.

    The only example I have to go by are the POWER MCMs, but those can burn nearly a kilowatt of power.

    The cost of MCM packaging is another consideration. It's not cheap to make special packaging, and those can get complex and expensive.
     
  10. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    I know - I mean it as an example of a product where multiple chips are used in a singe package (not of die to die interconnects). I was wondering if one combines that sort of setup with a die to die interconnect, whether some/much of the latency and power penalties of communicating across a PCB are eliminated...
     
  11. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,428
    Likes Received:
    426
    Location:
    New York
    Yeah that's exactly what we're discussing here. Whether the reduction in tape-out costs is greater than the increase in other costs required to make this approach work. And of course this isn't happening in a vacuum - you still need to maintain a certain level of performance in order to be able to price these things competitively. It doesn't help if your costs are lower than the competition if you can't compete on performance. I can see this happening down the road when process technology advancements slow down but I just don't see the impetus for it now. What would G80 be at 65nm? Somewhere near the size of G71?
     
  12. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,047
    Likes Received:
    4,998
    Wlll G80 has already taken a baby step to multi chip GPU packages with breaking out some functionality in the NVIO chip. So it wouldn't surprise me if they have a multi-chip/multi-core GPU in the works. When it'll happen though is the big unknown.

    Regards,
    SB
     
  13. Rangers

    Legend

    Joined:
    Aug 4, 2006
    Messages:
    12,322
    Likes Received:
    1,120
    I dunno this multi-chip thing doesn't sound good to me. ATI has a knack for introducing expensive unneeded features well before they're relevant of late. Who's to say this isn't another.
     
  14. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Yes, sure. But I'm not sure if R600 is that much different from, say, R580, R520, G80 or G70. After all, those also made it possible to disable major blocks for cheaper versions, so there were definitely also separate entities, right?

    Yes, just think back about the comment of Bob in a previous thread: it can take around 100 clock cycles just to cross a die and back. With a multi-die arrangement, this alone would almost double, even if the overall dies were smaller than a big monolithic die (because you won't have a square form factor anymore.)

    My guess would be that the latency is more in the logic that's required to prepare the data to send and to recover the data in receive than in the delay on the wires: in the former case, your talking clock cycles at a time, in the latter, it's just a wave propagation.
    That said, I assume there will be some aspects that are easier if you have tight control about placing them on the same substrate (like Xenos.)

    The major advantage would be the ability to have massive buses between tiles. I don't think it couldn't work in theory, but there's probably a big pile of practical complications: you'd need a bunch of strong drivers to cross the 150um die scribe, how to deal with power distribution, testability methodology, layout methodology, ESD issues, etc. and much more that I don't know about.

    My bet would be on 'more'. We're currently not even at 65nm for the high-end.
     
  15. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Hmm, well, I'm well out of my depth, but, actually that sounds like an interesting idea. Instead of offset-stacking them, would it make sense to build two "chips" -- a compute chip, and an interconnect one, and then stagger/stack them:

    ...........===.........===
    ======..======..======

    (please ignore alignment-dots :( )

    Layout of a ring bus probably scales a tad further than layout of a crossbar, which is kind of an interesting, random thought....

    -Dave
     
  16. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    (BACKSTAGE, the scream of the packaging expert, who now has to invent a completely new packaging technique.)

    In word: no. The top of the die is already 'welded' upside down on to the substrate. Your connection die would have nothing to connect to. In fact, the substrate would be the agent that can already do what your connect chip is supposed to do.

    For some it's windmills, for others it's the ring bus: I'm always more than willing to reopen the debates on my favorite topic, but I'm not convinced others would agree, so let's leave it at that. :wink:
     
  17. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Like a 1024 bit ring bus link you mean? :). Thanks for indulging a software guy with some gory HW details BTW. Assuming for the sake of pipe dreaming that the issues you mention were surmountable, how hard would it be to decently test a tile before cutting out the final dies? I was thinking that ideally you might want to cut up the wafer based on knowledge of whether individual tiles were functional or not, e.g. if you had 1 defective tile in a 2x2 grid, you'd turn that into a 1x1 and a 1x2 die or 3 1x1 dies, rather than having a SKU for 2x2 tile dies with 1 defective tile.

    The other possible advantage I was thinking of was that you should be able to react to changing demand more quickly: if nobody wants your high-end die because the competition just released the MegaBlaster 10000, you don't suddenly have a bunch of useless high-end wafers on your hands (or still making their way through the fab) - just cut them up differently to make mid-range dies...
     
  18. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Normally, dies are tested after cutting, but before packaging. Cutting by itself can put some mechanical stress on the die and result in failures. But there's no reason why you can't test before cutting. Your only problem would be that you may have larger fallout in after-packaging testing, which will increase overall cost.

    Much more of a problem: you can't cut out individual dies because you're using a circular saw that cuts across the whole wafer. (Basically, you first attach the wafer to a plastic that's stretched flat on a hollow ring. The saw will cut deep enough to separate the dies, but it doesn't cut through the plastic. Individual dies are separated only 150um from each other, so the saw is extremely precise.)

    Your only option would be to first test all the dies and then make some trade-off to decide how many 2x2 SKU's with defective pieces are acceptable and modify the cutting process accordingly. This would require a major change in the existing flow (right now, there is simply no control link between cutting and testing.)

    I'm sure both competitors also read TheInq religiously. They should be able to figure out that a competitor is about to release a new high-end killer at least a month or two up front, no? :wink:

    It's a neat idea, but I'm afraid it isn't going to happen anytime soon.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    A multi-chip architecture seems like it would need to be based on at least two "base chips".

    The baby chip would have only 1 ring bus port (due to limited die size), so it could talk to at most one neighbour. This would result in 1-chip and 2-chip GPUs. It wouldn't be a ring bus as such, just an interconnect. A dedicated port for PCI Express would be needed on each chip. It might be simplest to make each chip have an 8-lane port so that the two, together, share the 16-lane PCI Express bus to the CPU. Each chip might have a 32-bit or a 64-bit port to memory.

    The midrange chip would have enough area to support 2 ring bus ports. Memory would be a 64-bit port. This chip would form the basis of all the mainstream and better GPUs.

    Jawed
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    That's a round trip with a loaded ring bus, right?
    There are examples of more straightforward signal passage crossing wide dies in tens of clock cycles (possibly less) on chips with much higher clocks.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...