Multi Die GPUs

Digidi · Sep 17, 2020

Hello, i wanted to start a conversation about the multi die concepts which will come the next years and what are the advantages and disadvantages. What is possible what's not.

I asked my self why they always build a "complete" small gpu die for a chiplet and combine these small chiplets to a big gpu? Why they don't want to build a Frontend die with Rasterizer etc. a Shader die with the shader included and a backend die with the rops included.

If you cut it down to Frontend, Shader and Backend Dies you will be much more flexible whats needed for the workload. For Example a gaming GPU will made of 1 Frontend Die, 2 Shader Dies and 1 Backend Die. A CAD GPU will be made of 2 Frontend Die 1 Shader Die and 2 Backend Dies.

LordEC911 · Sep 17, 2020

Power efficiency is part of the answer.
Moving data around on a big die is still going to be much cheaper than going off die, especially going die-die-die.
Now true 3d stacking and some sort of TSV like interconnect/fabric could be interesting down the road, but leads to other complications.

Digidi · Sep 17, 2020

Thank you for your answer. But my question is, when they make chipletdesign with more chiplet, why they make always a chiplet which is a complete gpu on its own. Why they are not deviding the chiplets like the function units are?

Why they bild a chiplet wich inglueds all 3 main stages (Frontend, Shader, Backend)? Why they wont build a chiplet which only contains the frontend, another chiplet which contains only the shader and another chiplet wich only contains the Backend.

yuri · Sep 17, 2020

AMD CPU division went the "chiplet" way because they needed to create a high-volume 64 core CPU SKUs. AMD desperately needed that core count for chasing Intel's server CPUs. The resulting die size is: 8 * 72mm2 + 416mm2 = 1008mm2. Even minus the interconnect, it'd be a huge monolithic die.

Apparently, they couldn't afford making separate masks for different dies. So they reuse the chiplets in lowend servers, workstations, and highend desktops. But this comes at a price.

Despite being 7nm (+ 12nm) their lowend client CPUs are less power/efficient than Intel's elder 14nm-based Skylake. The signal goes out and in. The mobile-oriented Renoir die was also made as a monolith due power. This might improve with very advanced packaging methods. But these are very expensive.

So there is no silver bullet - there are trade-offs... Seeing the size and scalability of pure compute-oriented cards like V100 or A100, these might be the ones need splitting to several dies. But even for AMD's next-gen Arcturus there seem to be no chiplets in sight.

pcchen · Sep 17, 2020

Digidi said:
Thank you for your answer. But my question is, when they make chipletdesign with more chiplet, why they make always a chiplet which is a complete gpu on its own. Why they are not deviding the chiplets like the function units are?

Why they bild a chiplet wich inglueds all 3 main stages (Frontend, Shader, Backend)? Why they wont build a chiplet which only contains the frontend, another chiplet which contains only the shader and another chiplet wich only contains the Backend.

External connections are expensive, slow, and also power hungry, so you'll want to minimize it. If you make a chip doing frontend, and another chip doing shader, then another chip doing backend, it's likely that the interconnection requirements between these chips will be much larger. It's better to keep the interconnections internal, and external connections only for the absolute necessary components (namely the PCI express bus and to the video RAM).
The original Voodoo Graphics card and Voodoo 2 card was designed this way, with rasterizer and texture units on different chips. Later products integrated the texture unit into the rasterizer and thus no longer a separated design.

trinibwoy · Sep 17, 2020

Digidi said:
Why they wont build a chiplet which only contains the frontend, another chiplet which contains only the shader and another chiplet which only contains the backend.

As others have said, those chiplets need to talk to each other and moving data between them is slow and eats power. A high yielding big chip will always be better than multiple chiplets so I expect we won’t see GPU chiplets for quite a while.

CarstenS · Sep 17, 2020

Digidi said:
For Example a gaming GPU will made of 1 Frontend Die, 2 Shader Dies and 1 Backend Die. A CAD GPU will be made of 2 Frontend Die 1 Shader Die and 2 Backend Dies.

Once upon a time, there was a company called 3DLabs who basically did that. They had a card called Wildcat Realizm 800, that combined a Vertex/Scalability Unit (VSU) with two Visual Processing Units (VPU) and different memory pools. Very interesting and very dead by now. It was a very impressive card at the time, that excelled in CAD applications. Anandtech's article is still up.

But who knows, maybe one day we'll see a comeback of the tech.

Geeforcer · Sep 17, 2020

While we are going down the memory lane, there "was" also Rampage + Sage design that RTX 3090 may perhaps finally surpass in performance.

Has really been 18 years? I feel so f-ing old.

pTmdfx · Sep 17, 2020

Before discussing "why not chiplets", it is first better to understand the type and the sheer volume of data being moved on-chip between different GPU subsystems. Let's take at two prominent numbers of two hot paths in the RDNA shader array architecture:

1. CU export bus, targeting "frontend" (parameter caches), "backend" (ROP) and Global Data Share
At least one SIMD32 VGPR per clock (a guess based on GDS bandwidth). That gives 32 * 32-bit/clk = 128 B/clk (unidirectional).

2. GL1 to L2 cache partitions
quad-banked, 64B/clk per direction (in/out) per bank, so 256B/clk bidirectional, or 512B/clk in total

Now assuming the clock is fixed at 2 GHz, for each shader array:

1. Export bus moves max. 256 GB per second (1 VGPR/clk) to its destinations.
2. GL1 can ingress and egress 512GB/s simutaneously, i.e., 1 TB/s in total.

Now compare these figures to the on-package SerDes I/O specs of the MCM-based chiplets. One on-package link of Zepplin is around 42 GB/s (bidirectional; DDR4-2666 assumed). So using that as a basis, you need 6 or 24 of them to satisfy (1) or (2) respectively. You could also bump the clock to go narrower, but you can only go so far with single-ended I/O without the power use scaling exponentially.

You can probably now see why some considered speculations of CUs being scattered across chiplets "like Zen" consider far-fetched, in the context of maintaining the uniformity of memory access, and the perception of being one "BFGPU".

Silicon interposers (2.5D packaging) could be the saviour, since it can deliver the wire & bump density for GPU buses. On the other hand, AMD spoke publicly about Zen 2 Rome having considered silicon interposers but then dropped the idea because of the recile limit. So that's a data point to consider.

This is not to say it is impossible — the rumour mill suggested recently that Navi 3X might be a multi-die design comprising of "GCD" and "MCD", and MCD might imply "Memory Complex Die". As though on-package SerDes I/O cannot sustain the sheer volume of data moved by GPUs on-chip, it can make sense as a replacement of the conventional GDDR memory buses, in concert with HBM deployment (stack atop the MCD). This can give the power efficiency gain of HBM, albeit smaller due to SerDes I/O, though it could still be a sizable win since on-package I/O (vs off-package GDDR or SerDes) is easier to drive given shorter wires.

Digidi · Sep 17, 2020

pTmdfx said:
1. Export bus moves max. 256 GB per second (1 VGPR/clk) to its destinations.
2. GL1 can ingress and egress 512GB/s simutaneously, i.e., 1 TB/s in total.

Now compare these figures to the on-package SerDes I/O specs of the MCM-based chiplets. One on-package link of Zepplin is around 42 GB/s (bidirectional; DDR4-2666 assumed). So using that as a basis, you need 6 or 24 of them to satisfy (1) or (2) respectively. You could also bump the clock to go narrower, but you can only go so far with single-ended I/O without the power use scaling exponentially.

Is Data transfare such an issue? If you make for Example a HBM chace between two Chiplets (for example Frontend Chiplet and Shader Chiplet) and you connect both connecte them wit a hbm interface to a hbm staüle i think you can reache the transfarerates you need. For Examplle HBM Has 460 GB/s per staple.

itsmydamnation · Sep 17, 2020

I dont think the first multi die GPU's will be a scaling solution, but rather just target something like 2 chips. Getting 2 ~300mm chips with a high degree of performance scaling without to much die size dedicated to the connections to the other GPU would allow a single chip to do both mid range GPU and high end.
Given that TMSC has EMIB like product coming, i wonder how suitable it would be to have two chips "joined" at the front end and also maybe at the L2's ?

T2098 · Sep 18, 2020

Digidi said:
Is Data transfare such an issue? If you make for Example a HBM chace between two Chiplets (for example Frontend Chiplet and Shader Chiplet) and you connect both connecte them wit a hbm interface to a hbm staüle i think you can reache the transfarerates you need. For Examplle HBM Has 460 GB/s per staple.

It is. Take a look at a datasheet for some production HBM and look at how many pins/bumps there are for data transfer. You're right conceptually in that sticking some HBM in the center between two chiplets and dual-porting it seems conceptually to be a straightforward way of solving at least one of the problems (power consumption and bandwidth) but good luck routing that even on an interposer. There are also more problems that are quite difficult to solve, like cache coherency between the two compute chiplets.

I did see that Intel's EMIB solution was on the order of 0.3pJ/bit versus Infinity Fabric between chiplets being 2.0pJ/bit, so an EMIB-type link between two chiplets seems like a better plan, although still guaranteed to be significantly more power hungry than the same communication happening on-die.

Doesn't solve the cache coherency problem, but I think it would probably be easier to take a narrow and fast memory like GDDR6X, relax the spec a bit to allow for a bit longer traces, and dual port that. Stick your GDDR memory array in between the two compute dies and let both

Multi Die GPUs

Digidi

LordEC911

Digidi

yuri

pcchen

Moderator

trinibwoy

Meh

CarstenS

Moderator

Geeforcer

Harmlessly Evil

pTmdfx

Digidi

itsmydamnation

T2098

Attachments

Similar threads