Nvidia's 3000 Series RTX GPU [3090s with different memory capacity]

You know a chiplet based gaming GPU available right now? This isn't a proven thing, it is very possible that there will be issues and performance being lost to the design.
Well, you could find old enough XB360 used I suppose :D
But that wasn't the point really, both were in for something new again, as Hopper and Ada are too close on schedule to really gain any experience from the former to build latter better
 
Well, you could find old enough XB360 used I suppose :D
But that wasn't the point really, both were in for something new again, as Hopper and Ada are too close on schedule to really gain any experience from the former to build latter better
Why would Lovelace need anything from Hopper prior to Hopper's launch? It's not like the two are being made by different people or companies.
 
Why would Lovelace need anything from Hopper prior to Hopper's launch? It's not like the two are being made by different people or companies.
Ask trinibwoy, he's the one who said "Nvidia already knows how to make a massive 5nm die" which I answered to.
Edit: they're quite surely built by different people though, it's unlikely same team would be designing two separate architectures at once
 
But was the very first Voodoo a chiplet?
Was it on the same substrate? No.
Voodoo had multi-chip packages, still a nice joke though:D
I would consider Xenos to be a chiplet GPU - can't function without the EDRAM die where ROPs logic resides, both dies are integrated on the same substrate, so looks like chiplets to me.
 
Well, you could find old enough XB360 used I suppose :D
But that wasn't the point really, both were in for something new again, as Hopper and Ada are too close on schedule to really gain any experience from the former to build latter better

Hopper's tapeout was >6months prio Lovelace. That is enough time for nVidia to optimize the process.

And why are people believing that AMD will come out first? They will refresh their lineup next week...
 
Last edited:
Not sure what you mean by that or how it would even be relevant
-"knows how to build massive 5nm die" : any experiences gained from Hopper are too late to have real effect on Ada, they're schedules are far too close for that

Meaning Hopper is proof they know how to make a big honking 5nm chip so manufacturing shouldn’t be the reason Ada would miss the usual Q3 launch window.

And to nitpick they have even experience on building "chiplet GPUs" thanks to XB360, even though it was as simple as it was with just memory stuff on another die

Does AMD have experience writing software for a multiple chiplet GPU? MI250X is still two GPUs on a stick.
 
Meaning Hopper is proof they know how to make a big honking 5nm chip so manufacturing shouldn’t be the reason Ada would miss the usual Q3 launch window.
Ah, misunderstood it then.

Does AMD have experience writing software for a multiple chiplet GPU? MI250X is still two GPUs on a stick.
For XB360, yes. Not the exact same thing, but still counts.
 
Mobility Radeon had MCM packaging back in 2001. I don't know if this goes even further back.
http://hw-museum.cz/vga/371/ati-mobility-radeon

Nvidia also had similar space saving designs.
http://hw-museum.cz/vga/366/nvidia-geforce-go-6600-npb


Maybe chiplet isn't so novel.
MCM yeah, but it's not relevant in this context - memories have always been separate, in those cases they've just routed connections normally going through PCB to memories through packaging instead. Kaby Lake-G did this recently too, just like every other HBM GPU
 
Is there an example of processing being load balanced and synchronized over multiple chips?
At the driver level, that's an old news too, tiling (static) and SFR (dynamic) that is.
I can't remember an example of such work balancing being done at hw level transparently for the driver, which is supposedly the chiplet architecture is.
My thought is there have to be some SW level optimizations anyway to keep data local to chiplets.
 
At the driver level, that's an old news too, tiling (static) and SFR (dynamic) that is.
I can't remember an example of such work balancing being done at hw level transparently for the driver, which is supposedly the chiplet architecture is.
My thought is there have to be some SW level optimizations anyway to keep data local to chiplets.

Ironically it was easier to do so in the olden days (e.g. Voodoo 2 SLI).
Today with deferred rendering, render targets, screen space reflections, etc. it's much harder now to do load balance between multiple chips, even with reasonable amount of bandwidth between chips.
In a way, ray tracing might be able to help because ray tracing is reasonably easy to parallelize (you can distribute the data structure over multiple chips and let them calculate different random dots on the screen) and many render target tricks can be replaced with ray tracing (shadows, reflections, etc.), but since HW ray tracing is still in its infancy it's probably going to be years before that.
 
Today with deferred rendering, render targets, screen space reflections, etc. it's much harder now to do load balance between multiple chips, even with reasonable amount of bandwidth between chips.
I guess most of the difficulty comes down to keeping local copies of frame resources in each memory pool per GPU with AFR.
This might be fixed somewhat by moving to Split Frame Rendering or Tiled Checkerboad rendering, which suffer from duplication of geometry processing though.
The chiplet architecture should have a unified memory pool (and hopefully not NUMA). I wonder how efficient it will be with geometry processing though.
Traditional DX11/DX12 style pipeline with tessellation has a number of work distribution and balancing steps, which should be handled somehow on the chiplet architecture without duplicating the work.
The challenge with the chiplet architecture is mostly in geometry processing space, pixel processing should be much easier and uniform (and can be handled with SFR or tiled rendering in the same way as with the chiplet architecture).

n a way, ray tracing might be able to help because ray tracing is reasonably easy to parallelize (you can distribute the data structure over multiple chips and let them calculate different random dots on the screen) and many render target tricks can be replaced with ray tracing (shadows, reflections, etc.), but since HW ray tracing is still in its infancy it's probably going to be years before that.
Agree, RT is an interesting direction, there are no ordering requirements with BVH, so geometry processing can be easily distributed across many chips without super high bandwidth requirements.
 
I guess most of the difficulty comes down to keeping local copies of frame resources in each memory pool per GPU with AFR.
This might be fixed somewhat by moving to Split Frame Rendering or Tiled Checkerboad rendering, which suffer from duplication of geometry processing though.
The chiplet architecture should have a unified memory pool (and hopefully not NUMA). I wonder how efficient it will be with geometry processing though.
Traditional DX11/DX12 style pipeline with tessellation has a number of work distribution and balancing steps, which should be handled somehow on the chiplet architecture without duplicating the work.
The challenge with the chiplet architecture is mostly in geometry processing space, pixel processing should be much easier and uniform (and can be handled with SFR or tiled rendering in the same way as with the chiplet architecture).

I’m probably wrong but I’ve always assumed chiplet GPU architectures would require a central work distributor either on a separate die or on one of the compute dies. Something has to manage the incoming instruction stream from the CPU. Compute apis like CUDA natively support multiple devices and work distribution happens on the CPU but graphics apis have no such concept. Even in the case of DXR kernels the work distribution has to happen on the GPU. It can theoretically be handled by the driver but that would likely be very chatty and very slow.

A reasonable setup would be 3 dies. I/O, video, graphics and compute queues, memory controllers and L2 on one die with 2 compute dies hosting the shader engines / GPCs. The big question is L2 cache latency if that traffic goes off chip.
 
but graphics apis have no such concept. Even in the case of DXR kernels the work distribution has to happen on the GPU. It can theoretically be handled by the driver but that would likely be very chatty and very slow.
That's not the only issue with graphics APIs, another one is strict ordering requirements, objects must be processed in order they were submitted, which would limit parallelism significantly without synchronizing the work between chiplets.
There is little doubt, front end must be reworked to account for chiplets. I guess the new front end must be very smart to distribute the work across multiple chiplets in an efficient manner -- it should batch more draw calls and track more state than ever to keep utilization high.
 
Back
Top