AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
It's suggesting a separate L3 is on each compute chiplet but accessible from other chiplets. Which makes me think this was done before the big LLC was decided on, as that way you can have a separate giant LLC while making each compute smaller.

It's also got a memory bus on each compute chiplet like RDNA1 has, whereas I'd assume they'd stick closer to RDNA2 and have a more unified memory bus like Zen.


Each chiplet seems to be able to function in a fully autonomous way. They all have their own memory PHY, set of fixed function blocks (probably video codecs?), and they communicate with each other through the L3.

To me this looks a bit like first-gen Zen but in a GPU. It also means there could be a lot of wasted space due to a bunch of fixed-function blocks being replicated and useless for all chiplets but one. Though if they could get the video codecs to work in parallel it would be awesome, especially if more chiplets = higher resolution or higher framerate encoding).
 
Won't you waste space with MCM solutions ? I mean you still need the transistors for compute functions, etc, but now you need to have more external buses to make the multiples parts of the chip communicate between them ? I'm sure I'm missing something here. Or it's "just" about yield ?
 
Won't you waste space with MCM solutions ? I mean you still need the transistors for compute functions, etc, but now you need to have more external buses to make the multiples parts of the chip communicate between them ? I'm sure I'm missing something here. Or it's "just" about yield ?
In the end, with Ryzen, it was about going way past the conventional reticle limit in terms of transistor count. It also helped yield.

HBX will cost power and power is going to be a pain point for huge transistor counts. Look at Threadripper 3990X versus any Ryzen.

One thing that's not clear is the effect of Infinity Cache on power usage. We don't have a good "performance equivalent" estimator and probably never will. Some of that huge transistor count that chiplets allows for might simply be to increase the size of Infinity Cache.
 
Won't you waste space with MCM solutions ? I mean you still need the transistors for compute functions, etc, but now you need to have more external buses to make the multiples parts of the chip communicate between them ? I'm sure I'm missing something here. Or it's "just" about yield ?
Not just yield but also versatility on using the same chip for a multitude of SKUs and performance ranges, the economy of scale achieved by it, and the money saved by developing only one chip.
 
COMPUTE UNIT SORTING FOR REDUCED DIVERGENCE - Advanced Micro Devices, Inc. (freepatentsonline.com)

This has been floating around for a while. I can't work out from this whether there's any meaningful gain possible in a physical device.

It feels like I've been talking about this for 10 years:

AMD: R7xx Speculation

Conditional Routing was floating around back in 2004. Honestly, I don't feel motivated to compare this document and the Conditional Routing paper. I'll pay more attention when I see it in hardware.
 
Maybe more recent turn of events in rendering made such an investment, as this dynamic/analytical re-sorting more feasible to explore and investe die space in?
 
Ray tracing you mean? :LOL:

Yes I would tend to agree.

Ironically, DXR 1.0 was setup to give the hardware/driver an opportunity to sort threads before execution. Inline raytracing with DXR 1.1 doesn't offer that luxury.

DirectX Raytracing (DXR) Functional Spec

The scheduling portions of execution are hard-wired, or at least implemented in an opaque way that can be customized for the hardware. This would typically employ strategies like sorting work to maximize coherence across threads.
 
Not sure how ironic this is. If this sorting to limit divergence can be applied to any task (not just RT), it would be really interesting in general.
Might be a way to increase RT perf also without having fixed function traversal units as well, and still no HW dependency on the classical RT algorithm - still flexible.

But just dreaming... my method of reading patents became 'just look at the images and their description, but ignore the text' :)
 
Not sure how ironic this is. If this sorting to limit divergence can be applied to any task (not just RT), it would be really interesting in general.
Might be a way to increase RT perf also without having fixed function traversal units as well, and still no HW dependency on the classical RT algorithm - still flexible.

But just dreaming... my method of reading patents became 'just look at the images and their description, but ignore the text' :)

The patent isn’t ironic. General purpose sorting for coherence is sure to benefit both software and “hardware assisted” RT.
 
Ironically, DXR 1.0 was setup to give the hardware/driver an opportunity to sort threads before execution. Inline raytracing with DXR 1.1 doesn't offer that luxury.
What kind of coherency-sorting would that be? What kind of scale, memory usage and latency characteristics would it have? Would the sorting occur entirely on-die? Would it be a driver-originated CPU process?

Are we talking about categories of ray which would steer the sorting process? Or is it related to the originating triangles? Screen-space derived using a space-filling curve (e.g. Morton curve)?

If the driver or hardware is doing sorting then it's always better than sorting coded by the developer?

Sorting of work items, which the patent document refers to, is a more general kind of sorting, for any kind of shader (e.g. vertex shader). For ray tracing on AMD it would be applicable to every loop iteration as a workgroup traverses sub-trees of the BVH, since each step requires a decision: follow BVH children or report result of triangle hit? Or abandon the sub-tree. The decisions cause work group divergence.

Coherence of work item execution (keeping work group divergence to a minimum) is a different topic than finding coherent sets of BVH queries to issue in parallel. In the latter case you might have 1000 rays that share 10,000 BVH nodes or you might have 1000 rays that share 10,000,000 BVH nodes. Preferably you would want to issue queries for the second set with some sorting and binning, e.g. issuing sets of 10,000 BVH-localities. Of course the problem is to work out those BVH-localities in less work than just running without sorting/binning.
 
I could imagine the compiler simply looks for branches in code, adds 'sorting barriers', workgroups go idle on that and SIMDs continue with another one. Another unit performs the sorting / reordering on the idle tasks concurrently?
If so, ray tracing may have been listed just as an example in the patent.

Edit: probably i get it wrong, as reordering threads this way would destroy their context to LDS memory. But 'single threaded' shader stages like pixel or ray shaders might work.
 
What kind of coherency-sorting would that be? What kind of scale, memory usage and latency characteristics would it have? Would the sorting occur entirely on-die? Would it be a driver-originated CPU process? Are we talking about categories of ray which would steer the sorting process? Or is it related to the originating triangles? Screen-space derived using a space-filling curve (e.g. Morton curve)?

Those would be good questions for the IHV driver teams. It's anybody guess whether there's any sorting at all taking place on DXR 1.0 implementations.

If the driver or hardware is doing sorting then it's always better than sorting coded by the developer?

All else equal? Probably since the driver knows a lot more about the hardware than developers do, especially at runtime.

Of course the problem is to work out those BVH-localities in less work than just running without sorting/binning.

Yup, that's why it's the stuff of dreams.
 
All else equal? Probably since the driver knows a lot more about the hardware than developers do, especially at runtime.

All else equal? Probably not since the developer knows a lot more about the data coherency and utilized resources than driver programmers do, especially at compile time. ;°P
 
Tbh sounds like someone just adding Navi31 specs, which are practically same as Navi21, to the recent multi-chiplet patent
 
There will be key architectural improvements over RDNA 2

They include:
Zen 3 like cache
Scalability
Improved front end.
Much better ray tracing performance
Greatly upgraded geometry capabilities, based on their work for PS5.
some actual answer to DLSS. hardware? software?
TSMC 5nm, likely late Q2 2022 release.
 
Status
Not open for further replies.
Back
Top