Multi Die GPUs

Discussion in 'Architecture and Products' started by Digidi, Sep 17, 2020.

  1. Digidi

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    312
    Likes Received:
    155
    Hello, i wanted to start a conversation about the multi die concepts which will come the next years and what are the advantages and disadvantages. What is possible what's not.

    I asked my self why they always build a "complete" small gpu die for a chiplet and combine these small chiplets to a big gpu? Why they don't want to build a Frontend die with Rasterizer etc. a Shader die with the shader included and a backend die with the rops included.

    If you cut it down to Frontend, Shader and Backend Dies you will be much more flexible whats needed for the workload. For Example a gaming GPU will made of 1 Frontend Die, 2 Shader Dies and 1 Backend Die. A CAD GPU will be made of 2 Frontend Die 1 Shader Die and 2 Backend Dies.
     
    #1 Digidi, Sep 17, 2020
    Last edited: Sep 17, 2020
    milk, sonen and Man from Atlantis like this.
  2. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    837
    Likes Received:
    139
    Location:
    'Zona
    Power efficiency is part of the answer.
    Moving data around on a big die is still going to be much cheaper than going off die, especially going die-die-die.
    Now true 3d stacking and some sort of TSV like interconnect/fabric could be interesting down the road, but leads to other complications.
     
    sonen and Digidi like this.
  3. Digidi

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    312
    Likes Received:
    155
    Thank you for your answer. But my question is, when they make chipletdesign with more chiplet, why they make always a chiplet which is a complete gpu on its own. Why they are not deviding the chiplets like the function units are?

    Why they bild a chiplet wich inglueds all 3 main stages (Frontend, Shader, Backend)? Why they wont build a chiplet which only contains the frontend, another chiplet which contains only the shader and another chiplet wich only contains the Backend.
     
    #3 Digidi, Sep 17, 2020
    Last edited: Sep 17, 2020
  4. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    248
    Likes Received:
    231
    AMD CPU division went the "chiplet" way because they needed to create a high-volume 64 core CPU SKUs. AMD desperately needed that core count for chasing Intel's server CPUs. The resulting die size is: 8 * 72mm2 + 416mm2 = 1008mm2. Even minus the interconnect, it'd be a huge monolithic die.

    Apparently, they couldn't afford making separate masks for different dies. So they reuse the chiplets in lowend servers, workstations, and highend desktops. But this comes at a price.

    Despite being 7nm (+ 12nm) their lowend client CPUs are less power/efficient than Intel's elder 14nm-based Skylake. The signal goes out and in. The mobile-oriented Renoir die was also made as a monolith due power. This might improve with very advanced packaging methods. But these are very expensive.

    So there is no silver bullet - there are trade-offs... Seeing the size and scalability of pure compute-oriented cards like V100 or A100, these might be the ones need splitting to several dies. But even for AMD's next-gen Arcturus there seem to be no chiplets in sight.
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,864
    Likes Received:
    324
    Location:
    Taiwan
    External connections are expensive, slow, and also power hungry, so you'll want to minimize it. If you make a chip doing frontend, and another chip doing shader, then another chip doing backend, it's likely that the interconnection requirements between these chips will be much larger. It's better to keep the interconnections internal, and external connections only for the absolute necessary components (namely the PCI express bus and to the video RAM).
    The original Voodoo Graphics card and Voodoo 2 card was designed this way, with rasterizer and texture units on different chips. Later products integrated the texture unit into the rasterizer and thus no longer a separated design.
     
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,881
    Likes Received:
    1,101
    Location:
    New York
    As others have said, those chiplets need to talk to each other and moving data between them is slow and eats power. A high yielding big chip will always be better than multiple chiplets so I expect we won’t see GPU chiplets for quite a while.
     
  7. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,165
    Likes Received:
    2,685
    Location:
    Germany
    Once upon a time, there was a company called 3DLabs who basically did that. They had a card called Wildcat Realizm 800, that combined a Vertex/Scalability Unit (VSU) with two Visual Processing Units (VPU) and different memory pools. Very interesting and very dead by now. It was a very impressive card at the time, that excelled in CAD applications. Anandtech's article is still up.

    But who knows, maybe one day we'll see a comeback of the tech.
     
    Silenti, Digidi and BRiT like this.
  8. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,319
    Likes Received:
    519
    While we are going down the memory lane, there "was" also Rampage + Sage design that RTX 3090 may perhaps finally surpass in performance.

    Has really been 18 years? I feel so f-ing old.
     
    DavidGraham likes this.
  9. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    341
    Likes Received:
    282
    Before discussing "why not chiplets", it is first better to understand the type and the sheer volume of data being moved on-chip between different GPU subsystems. Let's take at two prominent numbers of two hot paths in the RDNA shader array architecture:

    1. CU export bus, targeting "frontend" (parameter caches), "backend" (ROP) and Global Data Share
    At least one SIMD32 VGPR per clock (a guess based on GDS bandwidth). That gives 32 * 32-bit/clk = 128 B/clk (unidirectional).

    2. GL1 to L2 cache partitions
    quad-banked, 64B/clk per direction (in/out) per bank, so 256B/clk bidirectional, or 512B/clk in total

    Now assuming the clock is fixed at 2 GHz, for each shader array:

    1. Export bus moves max. 256 GB per second (1 VGPR/clk) to its destinations.
    2. GL1 can ingress and egress 512GB/s simutaneously, i.e., 1 TB/s in total.

    Now compare these figures to the on-package SerDes I/O specs of the MCM-based chiplets. One on-package link of Zepplin is around 42 GB/s (bidirectional; DDR4-2666 assumed). So using that as a basis, you need 6 or 24 of them to satisfy (1) or (2) respectively. You could also bump the clock to go narrower, but you can only go so far with single-ended I/O without the power use scaling exponentially.

    You can probably now see why some considered speculations of CUs being scattered across chiplets "like Zen" consider far-fetched, in the context of maintaining the uniformity of memory access, and the perception of being one "BFGPU".

    Silicon interposers (2.5D packaging) could be the saviour, since it can deliver the wire & bump density for GPU buses. On the other hand, AMD spoke publicly about Zen 2 Rome having considered silicon interposers but then dropped the idea because of the recile limit. So that's a data point to consider.

    This is not to say it is impossible — the rumour mill suggested recently that Navi 3X might be a multi-die design comprising of "GCD" and "MCD", and MCD might imply "Memory Complex Die". As though on-package SerDes I/O cannot sustain the sheer volume of data moved by GPUs on-chip, it can make sense as a replacement of the conventional GDDR memory buses, in concert with HBM deployment (stack atop the MCD). This can give the power efficiency gain of HBM, albeit smaller due to SerDes I/O, though it could still be a sizable win since on-package I/O (vs off-package GDDR or SerDes) is easier to drive given shorter wires.
     
    #9 pTmdfx, Sep 17, 2020
    Last edited: Sep 17, 2020
    Alexko, xpea, Geeforcer and 2 others like this.
  10. Digidi

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    312
    Likes Received:
    155
    Is Data transfare such an issue? If you make for Example a HBM chace between two Chiplets (for example Frontend Chiplet and Shader Chiplet) and you connect both connecte them wit a hbm interface to a hbm staüle i think you can reache the transfarerates you need. For Examplle HBM Has 460 GB/s per staple.
     
  11. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,326
    Likes Received:
    435
    Location:
    Australia
    I dont think the first multi die GPU's will be a scaling solution, but rather just target something like 2 chips. Getting 2 ~300mm chips with a high degree of performance scaling without to much die size dedicated to the connections to the other GPU would allow a single chip to do both mid range GPU and high end.
    Given that TMSC has EMIB like product coming, i wonder how suitable it would be to have two chips "joined" at the front end and also maybe at the L2's ?
     
  12. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    28
    Likes Received:
    58
    It is. Take a look at a datasheet for some production HBM and look at how many pins/bumps there are for data transfer. You're right conceptually in that sticking some HBM in the center between two chiplets and dual-porting it seems conceptually to be a straightforward way of solving at least one of the problems (power consumption and bandwidth) but good luck routing that even on an interposer. There are also more problems that are quite difficult to solve, like cache coherency between the two compute chiplets.

    I did see that Intel's EMIB solution was on the order of 0.3pJ/bit versus Infinity Fabric between chiplets being 2.0pJ/bit, so an EMIB-type link between two chiplets seems like a better plan, although still guaranteed to be significantly more power hungry than the same communication happening on-die.

    Doesn't solve the cache coherency problem, but I think it would probably be easier to take a narrow and fast memory like GDDR6X, relax the spec a bit to allow for a bit longer traces, and dual port that. Stick your GDDR memory array in between the two compute dies and let both
     

    Attached Files:

    #12 T2098, Sep 18, 2020
    Last edited: Sep 18, 2020
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...