NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
Frametimes were inconsistent using AFR because the different GPUs were working on different frames in parallel and it was difficult to ensure consistency of when frames would be delivered to the monitor. This is not a problem for chiplets as all of the chiplets will be cooperating in rendering a single frame at a time and therefore deliver consistent performance frame to frame.
I suppose I’m skeptical that a multi chiplet GPU will work consistently across the breadth of games without developers having to account for it. It’s quite a complex problem to solve.
 
Well not anymore.
MI200 onwards RIP for that rudiment.
Then your first reply ("it's just more SEs") is basically worthless. So thanks for that.
edit: Oh wait, did you just say MI200? Wrong thread maybe, again?

Then they would also have to rewrite Shader Exports (vGPR -> Prim/RB/GDS) and some atomics to be done elsewhere.
 
Last edited:
I suppose I’m skeptical that a multi chiplet GPU will work consistently across the breadth of games without developers having to account for it. It’s quite a complex problem to solve.

From a software perspective it’s still one big GPU. The game engine or graphics api won’t know that there are multiple dies just like games don’t know or care that there are multiple GPCs or SMs today. Yes it’s a complex problem but it’s a hardware problem. So really the main question is performance. And of course power consumption.
 
I suppose I’m skeptical that a multi chiplet GPU will work consistently across the breadth of games without developers having to account for it. It’s quite a complex problem to solve.

And that's why RDNA3 being that "on paper" is a big deal :D

On paper PowerVR B series can do that too if I believe the slides.
 
Yes it’s a complex problem but it’s a hardware problem.
It's not just a HW problem. With current API model, all draw calls must be executed in order, current single die GPUs are having hard time keepeng execution units busy during geometry processing and that's why all that work distribution in HW is needed (other than work expansion handling), when moving to the multi-die it can only get worse and probably by a lot.
 
It's not just a HW problem. With current API model, all draw calls must be executed in order, current single die GPUs are having hard time keepeng execution units busy during geometry processing and that's why all that work distribution in HW is needed (other than work expansion handling), when moving to the multi-die it can only get worse and probably by a lot.
Let's say global scheduling fails to distribute the draw call work over 2 chiplets, and chiplet 1 finishes earlier. (Could be minimized with using more fine grained queues on both chiplets, and more frequent pulling of smaller pieces of work from the top level queue.)
This would be no problem if we have async work around, which will leverage the more, the more such async work we have around.
Also, even if it's impossible to guarantee all draws / dispatches finish at the exact same time, we are still almost twice as fast due to having twice the power.
But we may want to give up on things like guaranteed draw order of each individual triangle, which we rarely need and already have extensions for to turn this ordering off.
Do i miss something? (I'm just bad with graphics pipeline)

I imagine it like a job system, where work is small chunks and multiple threads pull once done with negligible overhead. And we can do rendering with multi threading easily this way on CPU.
We also divide the framebuffer into tiles to avoid hazards, which is no problem either, and dGPUs already have that 'binned rasterizer' stuff which relates to tiles? (Triangle order can be preserved this way too, like mobiles do with the same APIs)
 
No, it's literally that. Work distribution is neither a novel problem nor modern GPUs are small from big macroblock POV.
Literally, it's GPCs, not SEs. And work distribution: Yeah, tell me more Cpt. Obvious.
NV does 7 GPC for A102 and it's only ever going to bloat.
A100 already has been at 8 before, 4 of which are talking to a 20-MByte-L2-Partition each.

If we're talking MCM, it's not an intra-chip problem anymore, so the question is a real one: Will there be an additional synchronization stage inside the GPC/SM chiplet? How much data will have to be synchronized between chiplets? For graphics, there's a reason, why ROPs for quite some time are part of the GPCs and RBs have beend moved into the SE as well. And maybe this can be discussed with more than half-sentence one-liners.
 
Literally, it's GPCs, not SEs
Those things carry the same purpose.
4 of which are talking to a 20-MByte-L2-Partition each.
Actually it's a bit more funky than that.
Consult the Citadel paper if it's out.
it's not an intra-chip problem anymore
It's more or less is given the packaging involved.
It's a very straightforward solution if a tad bit on a complex/expensive side.
why ROPs for quite some time are part of the GPCs
Wut.
They've moved in there with Ampere.
AMD moved them one step up the chain with Navi, even.
And maybe this can be discussed with more than half-sentence one-liners.
That would be less cute and funny and you wouldn't be able to apply for a capeshit writer position thus no.
 
Maybe someone is missing the information that an inter-die communication in a 3D stacked configuration can easily go up to some TB/s. In V-cache on Zen3, they added 64Mbytes of cache at basically zero penalties. I expect on the GPU side to have some more latency but bandwidth (for graphic or other data) to be a non-issue.
 
And a 3090 has nearly 1TB/s offchip bandwidth. Bandwidth is not the problem. Having multiple chips means synchronisation work between them. Every memory movement will cost effciency.
 
It's not just a HW problem. With current API model, all draw calls must be executed in order, current single die GPUs are having hard time keepeng execution units busy during geometry processing and that's why all that work distribution in HW is needed (other than work expansion handling), when moving to the multi-die it can only get worse and probably by a lot.

Sure but that’s a separate problem which as you said also applies to existing single chip solutions. Yes it will get worse on chiplets but it will get worse on wider monolithic chips too.
 
And a 3090 has nearly 1TB/s offchip bandwidth. Bandwidth is not the problem. Having multiple chips means synchronisation work between them. Every memory movement will cost effciency.

Lol. If the issue is sinchronization between SE/GCDs, it would be quite trivial to put a dedicated synchronization channel in the 3D stack, as connection through the stack is almost the same as on-chip (and THAT is one of the reasons 3D stacking by meand of TSV is very important).
 
I think it's fair to expect MCM GPUs to not scale as well as a monolithic GPU would, and there will certainly be some cases where they may not scale at all in some older games.
Thus saying that it's a purely h/w problem is a bit disingenuous since while it may be somewhat helped with h/w the more obvious option is to target such architecture with s/w.
So while the general direction of all HP designs going into multichip era is forced by the end of physical scaling era it's hardly a clear cut that the first MCM GPU attempts will actually fare better than the last big single die ones.
 
Lol. If the issue is sinchronization between SE/GCDs, it would be quite trivial to put a dedicated synchronization channel in the 3D stack, as connection through the stack is almost the same as on-chip (and THAT is one of the reasons 3D stacking by meand of TSV is very important).

How is the connection between the stack identical to a L2 cache? Would this be the case a L2 cache is unnecessary.
 
Last edited:
I think it's fair to expect MCM GPUs to not scale as well as a monolithic GPU would, and there will certainly be some cases where they may not scale at all in some older games.
Thus saying that it's a purely h/w problem is a bit disingenuous since while it may be somewhat helped with h/w the more obvious option is to target such architecture with s/w.
So while the general direction of all HP designs going into multichip era is forced by the end of physical scaling era it's hardly a clear cut that the first MCM GPU attempts will actually fare better than the last big single die ones.

There's no need to explicitly expose the multi-chip nature of future gpus to the application. The workload is so massively parallelizable that it would be a failure of hardware and software design if developers had to concern themselves with how many chiplets are on a card. Improvements to api's to remove unnecessary serialization will help but as mentioned above this is not a unique problem to chiplet solutions.

Why would older games not scale? They're still spitting out lots of pixels that are being written to a shared framebuffer. None of that changes with chiplets.
 
How is the connection between the stack identical to a L2 cache? Would this be the case a L2 cache is unnecessary.

Where I did mention L2 cache. I said that in a 3D stack you can easily add not ony the connection from the compute die to the cache die but also synchronization signals that can propagate through the entire stack, because the 3D stack is basically equivalent to a very big IC with some latency penalties in the interconnection points. I wonder why I even try to explain trivial concepts to someone that thinks a +30% gain in terms of perf/W on the same process node is bad. It's clear that either 1) you are not understanding what a 3D stack is, thus not understanding that every single signal can be passed though the TSVs and not only memory inteface ones or 2) You are trying to spread FUD. As it's quite evident by your post hystory that the second is quite the most probable one, let me put in ignore so I avoid my eyes getting hurt by your FUD spreading.
 
This would be no problem if we have async work around, which will leverage the more, the more such async work we have around.
That's not a solution. AMD has moved in the exactly opposite direction with RDNA for a reason - it had failed with GCN at finding async workloads on PC. Straightforward ports from consoles simply didn't have enough of work to fill PC's GPUs.
GCN was bad at scaling beyond 4096 ALUs and there is nothing magic in RDNA.
Feeding 2 ~8000 ALUs chips would require having in flight at least 32 000 work items per cycle (assuming 2 wavefronts is enough to cover at least instruction latencies and there is a perfect prefetch to cover memory stalls).
Now I can imagine there will be plenty of draws in modern games with vertices count way below 32 000 (that was the whole idea behind DX12 after all), which would require tons of bookkeeping and work balancing in order to keep executing them in parallel with other draw calls (and there unavoidably will be serialization points since all draws must be finished in a serial order).
Feeding such machine would be hard, distributing work across chips won't be easy and finding async work is not easy either, these are all unsolved problems.
If you had ever dealt with performance modelling, you would intuitively know that such config would have crappy scaling in graphics workloads (way below linear), but lets hope I am wrong and all the problems would be magically mitigated (even though that's not the case today and all points that things may get only worse with inter chip work distribution).

But we may want to give up on things like guaranteed draw order of each individual triangle, which we rarely need and already have extensions for to turn this ordering off.
This would require either new APIs, completely compute based pipelines or a mixed software-hardware TBDR like approach (which has tons of other drawbacks).
 
NVIDIA Lovelace vs AMD RDNA 3, what has not been told about their GPUs (techunwrapped.com)

There is another rumor that speaks of change in the organization of your GPUs in the next generation by NVIDIA, where the minimum unit will be the SM and the subcores will disappear, so the SM unit will have a general scheduler instead of having one in each subcore, in that aspect it will look much more like the architecture from AMD where the lowest level cache is shared for all SM equally.

The next point is the TPC, this does not undergo changes, except that this time it will group 3 SM instead of 2 SM, but it is in the appearance of the Cluster Processor Core where the interesting comes from. Each CPC will have 3 TPC inside it and therefore we speak of 18 SMs per GPC or 6 per CPC. The particularity of CPCs? Apparently each one is assigned a new L1 cache of data and instructions. The amount of CPC per GPC? Well, according to rumors, there would be three in total, but the amount of CPC per GPC could be variable as the number of TPC currently is, but we do not know this last detail.

But, the article speculates that Lovelace has been confused for Hopper. And then seemingly goes wild: "Hence, we think that the configuration of 144 Shader Units could correspond to Hopper and not Lovelace, since it is said that Lovelace will be a multi-chip GPU."

Separately there's noise in Twitterverse that Lovelace will have much higher clocks, similar to RDNA 2/3. That's one way of using less SMs, which is the central bone of contention that the article focuses on.
 
Status
Not open for further replies.
Back
Top