NVidia Ada Speculation, Rumours and Discussion

trinibwoy · Aug 1, 2021

CarstenS said:
Or it would be a two-way safety measure, betting on supply constrains for competition getting produced at TSMC. They have an independent chip source to cover if for whatever reasons they cannot get enough chips out of TSMC themselves and they diverse their second big source of income out.

Remember, Nvidia only has it's GPU business. If at AMD something should have backfired on consumer chiplet-technology, they could at least cover with their CPU and Semi businesses. If Nvidias first try at "real" MCM (not talking HBM here) had problems of whatever kind, they'd still have the other option to cover basic operations. Contingencies. Jen-Hsun really was internalizing what they learned during the first Fermi debacle.

Contingencies are fine but you don’t design your fundamental system architecture around foundry capacity. A sensible contingency is to tape out both a monolithic chip and chiplets. That way if your chiplets go bust you have a backup. However you wouldn’t choose to skip chiplets completely because of wafer supply. On the contrary that would be even more reason to embrace chiplets.

Frenetic Pony said:
I mean, it's not impossible for Nvidia's consumer GPU chiplets to "not be ready", why I mentioned it was only one hypothesis. But chiplets just make too much business sense, there's no "not the right time to bring it to market", there's only "bring it to market as fast as we can". If their engineers for that are doing AI chiplets first and foremost, then of course that's what they'll do, and the consumer GPUs can wait however long. It'll take AMD 9-12 months (ish?) to go from AI to consumer GPU chiplets, but... maybe it'll take Nvidia longer, long enough that a monolithic arch makes sense, I don't know. Or heck if only takes them equivalently, but Hopper isn't due out till the end of next year, maybe having up to date products is worth the giant investment of a quickly outdated arch just to ensure pressure isn't taken off the competition and their market share is kept up. A new arch in 3 years instead of 2 sounds like a lot, these days.

But assuming TSMC "just has more wafers to throw around" is silly. We know they don't, we know they're booked solid for years and keep announcing yet more new foundries and more investments. "They'll just make chiplets at TSMC" isn't an assumption that can be made at all. Hells imagine if AMD doesn't have enough wafers to go around either. If that's the case, and Nvidia has their arch on Samsung, not TSMC, they'd have AMD soundly beaten in the supply category for consumer GPUs, chiplets or not. AMD would prioritize CDNA2 over RDNA3, while Nvidia wouldn't even have to make that call.

So you don’t believe the rumors that Lovelace is on TSMC 5nm?

If chiplets make too much business sense and Nvidia’s chiplet arch is ready then they would make chiplets. However you’re also saying that they would choose to not make chiplets which presumably would not make business sense. So which is it?

If all the stars are aligned to bring chiplets to market it would be beyond silly to just sit on the tech and not bring it to market due to something as fleeting as wafer capacity. There are so many disadvantages to this including lost opportunities to refine the architecture based on real world experience and massive strategic risk to competitive advantage. So at the risk of repeating myself if we don’t see Nvidia gaming chiplets it’s because they decided it actually doesn’t make the best business sense or their tech simply isn’t ready.

troyan · Aug 1, 2021

trinibwoy said:
The same way they schedule work over multiple shader engines that are only connected through L3 today. The only difference is off-chip vs on-chip comms latency.

Is AMD really going throught the L3 cache instead of L2? Doesnt sound very effcient.
With Chiplets the workload has to be scheduled even every time. Rewriting into and moving data in the L3 cache is an effciency killer. nVidia hasnt even solved the problem in the HPC/DL market and they are working on automatic scheduling workload over a programming model since 10 years or so.

With UE5 Nanite, BVH for hardware RT and Direct3D I/O the VRAM usage will only go up. So a L3 cache wont help AMD which makes scheduling a even harder part with a chiplet design.

CarstenS · Aug 1, 2021

trinibwoy said:
Contingencies are fine but you don’t design your fundamental system architecture around foundry capacity. A sensible contingency is to tape out both a monolithic chip and chiplets. That way if your chiplets go bust you have a backup. However you wouldn’t choose to skip chiplets completely because of wafer supply. On the contrary that would be even more reason to embrace chiplets.

Of course, and that's not what I meant (i.e. skipping chiplets completely, but only with one product line so you have a fallback [not for tech, but for cash] just in case). But when you don't have another horse in the race, you don't bet your company on a single tech that you've not really had a pipe cleaner before. Unless you're really really desparate and it's either succeed or be bankrupt in another quarter anyway. That's why it makes all the sense in the world, to go chiplets with one branch and - while you think you still can - go monolithic with the other one. That's what AMD did, that's what Intel's kind of doing (in a way). They've all dipped their toes in MCM with HBM though.

trinibwoy · Aug 1, 2021

troyan said:
Is AMD really going throught the L3 cache instead of L2? Doesnt sound very effcient.

Well yeah, by definition everything can't fit in L2 on RDNA 2 and a lot of traffic hits L3.

With Chiplets the workload has to be scheduled even every time. Rewriting into and moving data in the L3 cache is an effciency killer. nVidia hasnt even solved the problem in the HPC/DL market and they are working on automatic scheduling workload over a programming model since 10 years or so.

Yeah presumably scheduling traffic doesn't go through the cache hierarchy so it will be interesting to see where the PCIe host link sits and how scheduling is managed across GCDs. I imagine there would be a dedicated IF link between dies for this traffic that doesn't go through the MCDs. Question is there going to be some sort of host I/O scheduler die that controls multiple GCDs or will it be setup like the old school crossfire days with one GCD serving as "master".

troyan · Aug 1, 2021

trinibwoy said:
Well yeah, by definition everything can't fit in L2 on RDNA 2 and a lot of traffic hits L3.

And that doesnt look very efficient. Navi22 has 3MB L2 and 96MB L3 cache and yet it loses to a 3070 with just 4MB L2 cache in every workload. A 3070 without Tensor- and RT Cores would be as big as Navi22.

Yeah presumably scheduling traffic doesn't go through the cache hierarchy so it will be interesting to see where the PCIe host link sits and how scheduling is managed across GCDs. I imagine there would be a dedicated IF link between dies for this traffic that doesn't go through the MCDs. Question is there going to be some sort of host I/O scheduler die that controls multiple GCDs or will it be setup like the old school crossfire days with one GCD serving as "master".

And that is the huge problem with a chiplet design. The additional overhead makes it less competitive when the competition can stay well within the limitation of the process. So the successor of GA104 can be twice as effcient and easily beat AMD's chiplet competition.

trinibwoy · Aug 1, 2021

troyan said:
And that is the huge problem with a chiplet design. The additional overhead makes it less competitive when the competition can stay well within the limitation of the process. So the successor of GA104 can be twice as effcient and easily beat AMD's chiplet competition.

Of course chiplets aren't as efficient as monolithic dies. The entire point of chiplets is to raise peak performance beyond what is possible with a single die.

techuse · Aug 2, 2021

I don’t see how chiplets are feasible with the extent of frame data reuse. How will latency and synchronization not kill performance and frametimes?

JoeJ · Aug 2, 2021

techuse said:
I don’t see how chiplets are feasible with the extent of frame data reuse. How will latency and synchronization not kill performance and frametimes?

All chiplets share the same VRAM, where the previous frame lives? So no problem at all?
I'm not that worried. A workgroup does not communicate with other workgroups, and ofc. a workgroup runs on only one chiplet. Synchronization work, like distributing workgroups over many chiplets, and signaling when they are all done so the dispatch has finished, should cause very little work and BW in comparison to the actual work itself? Also synchronizing caches is more relaxed on GPUs if this were a problem?
I guess it will work pretty well, and earlier thoughts about necessary workarounds don't seem necessary. I also see no hurdle for RT, as it's just reading memory, but there's no communication and sync. Maybe atomics to VRAM become more expensive, but fancy visibility buffer ideas will still keep their advantages i guess.
Though, i know very little about HW details in comparison to other guys here.

Rootax · Aug 2, 2021

techuse said:
I don’t see how chiplets are feasible with the extent of frame data reuse. How will latency and synchronization not kill performance and frametimes?

The vram will be unified, yes ?

Now, I guess multichips will create some latency here and there, but my guess is it will be négligeable vs the raw power gain. And they put a lot of work into that, there a reason it was not done before (without duplicating vram and count on sli/afr solution).

techuse · Aug 2, 2021

JoeJ said:
All chiplets share the same VRAM, where the previous frame lives? So no problem at all?
I'm not that worried. A workgroup does not communicate with other workgroups, and ofc. a workgroup runs on only one chiplet. Synchronization work, like distributing workgroups over many chiplets, and signaling when they are all done so the dispatch has finished, should cause very little work and BW in comparison to the actual work itself? Also synchronizing caches is more relaxed on GPUs if this were a problem?
I guess it will work pretty well, and earlier thoughts about necessary workarounds don't seem necessary. I also see no hurdle for RT, as it's just reading memory, but there's no communication and sync. Maybe atomics to VRAM become more expensive, but fancy visibility buffer ideas will still keep their advantages i guess.
Though, i know very little about HW details in comparison to other guys here.

But what if chiplet A needs information from chiplet B? How do you manage cache locality? How do you manage scheduling?

Bondrewd · Aug 2, 2021

techuse said:
But what if chiplet A needs information from chiplet B?

?
It's just more SEs.

techuse said:
How do you manage cache locality?

The same way we do on A100.
Kinda yuck but it can't be helped.

Dictator · Aug 2, 2021

techuse said:
But what if chiplet A needs information from chiplet B? How do you manage cache locality? How do you manage scheduling?

I imagine just by being inefficient? AFR style doubling? Paying for a bit of inefficiency for over all higher performance?

trinibwoy · Aug 2, 2021

techuse said:
But what if chiplet A needs information from chiplet B? How do you manage cache locality? How do you manage scheduling?

This isn’t like SLI where each GPU has its own memory and has to copy data to the other GPU. Both chiplets share the same VRAM and there is no data that belongs to only one chiplet.

Think about what happens today when an SM or WGP writes new data like populating the g-buffer. How do the other SMs and WGPs get access to the newly written data?

CarstenS · Aug 2, 2021

Bondrewd said:
?
It's just more SEs.

SEs use GDS.

Bondrewd · Aug 2, 2021

CarstenS said:
SEs use GDS.

Well not anymore.
MI200 onwards RIP for that rudiment.

trinibwoy · Aug 2, 2021

CarstenS said:
SEs use GDS.

GDS is an application controlled scratchpad for shared data within a kernel. It doesn’t solve the general problem of keeping data coherent across caches on the chip.

A100 actually is an interesting case study as it essentially has 2 independent L2 caches on a single die. So there is some coherence mechanism keeping them in sync with the benefit of really fast on chip links. Chiplets will solve the same problem just using slower on package communication.

OlegSH · Aug 2, 2021

Bondrewd said:
It's just more SEs.

Graphics pipeline does a lot of work reshuffling across GPC partitions - geometry redistribution for workload balancing, screen-space partitioning for better locality, etc.
As of now, the large crossbar solves all these distribution issues, if it wasn't the case everyone would simply partition this complex whole chip level network into smaller, more efficient and manageble pieces like NVIDIA did in GA100 (luckily partitioning is ok for compute), which is already 2 GPUs on a single die.
When we break the crossbar into 2 pieces, there will be obvious and expected efficiency losses for the workload balancing operations, especially in graphics since distributing work evenly across 2 chips would never be the case.

techuse · Aug 2, 2021

Dictator said:
I imagine just by being inefficient? AFR style doubling? Paying for a bit of inefficiency for over all higher performance?

Inefficiency wasn't the only problem, frametimes were hugely impacted.

Bondrewd · Aug 2, 2021

trinibwoy said:
Chiplets will solve the same problem just using slower on package communication.

It's not even particularly slower since SoIC is hella fancy.

OlegSH said:
luckily partitioning is ok for compute

Still has funny downsides, too.
Either way it's just ripping the bandaids off.

trinibwoy · Aug 2, 2021

techuse said:
Inefficiency wasn't the only problem, frametimes were hugely impacted.

Frametimes were inconsistent using AFR because the different GPUs were working on different frames in parallel and it was difficult to ensure consistency of when frames would be delivered to the monitor. This is not a problem for chiplets as all of the chiplets will be cooperating in rendering a single frame at a time and therefore deliver consistent performance frame to frame.

NVidia Ada Speculation, Rumours and Discussion

trinibwoy

Meh

troyan

CarstenS

Moderator

trinibwoy

Meh

troyan

trinibwoy

Meh

techuse

JoeJ

Rootax

techuse

Bondrewd

Dictator

trinibwoy

Meh

CarstenS

Moderator

Bondrewd

trinibwoy

Meh

OlegSH

techuse

Bondrewd

trinibwoy

Meh

Similar threads