AMD: RDNA 3 Speculation, Rumours and Discussion

Digidi · Jun 15, 2021

Split Frame Rendering for chiplet Design?

https://www.freepatentsonline.com/10922868.html

Jawed · Jun 15, 2021

Digidi said:
Split Frame Rendering for chiplet Design?
https://www.freepatentsonline.com/10922868.html

So the diagrams in this document are interesting because the output of Hull Shader and Tessellator are synchronisation/data-transfer workloads between two APDs (accelerated processing devices). This is the "gnarly" part of a multi-chiplet architecture because down-stream function blocks determine which chiplet will use which chunks of output produced by HS or TS. The routing of the work is "late", when screen-space is used to determine workload apportionment.

The synchronisation/data-transfer tasks require queues ("FIFO" in the document), which is where an L3 (Infinity Cache) would come in. The document locates the FIFOs within each APD, but if there's a monster chunk of L3, let's say 512MB, shared by two chiplets, that would seem to be a preferable place. These buffers would not waste die space if they were dedicated memory blocks within each chiplet when tessellation is not being used.

AMD always struggled with stream out functionality of the geometry shader, compared with NVidia. NVidia better-handled SO with on-chip buffers (cache) whereas AMD always decided to use off-chip memory (AMD's drivers over time messed about with GS parameters that tried to avoid the worst problems associated with the volume of data produced by GS). Similarly, tessellation has always caused AMD problems because on-die buffering and work distribution were very limited. In the end SO buffering is effectively no different from the FIFOs that are required to support HS and TS work-distribution. So whether we're talking about a single die or chiplets, fast, close, memory is a key part of the solution.

So if AMD is to solve the FIFO problem properly, it will need to use a decent chunk of on-package memory.

Similarly the "tile-binned" rasterisation in:

https://www.freepatentsonline.com/10957094.html

would seem to depend upon an L3. We've seen from NVidia's tile-binned rasterisation that the count of triangles/vertices that can be cached on die varies according to the count of attributes associated with each vertex (and the per-pixel count of bytes defined by the format of the render target). I don't think we've ever really seen a performance degradation analysis for NVidia in games according to the per-vertex/-pixel data load, but as time has gone by it appears NVidia has substantially increased the size of on-die buffers to support tile-binned rasterisation.

It seems to me that a monster Infinity Cache lies at the heart of these algorithms for AMD. Well, I imagine that comes across as "stating the obvious", but NVidia has been using a reasonably large on-die cache for quite a long time so it's time AMD caught up.

In theory, with RDNA 2, Infinity Cache is already making tessellation work better. But I don't remember seeing any analysis.

Stupid question time: I can't find a speculation thread for NVidia GPUs after Ampere. Is there one?

Deleted member 90741 · Jun 26, 2021

A couple of new patents presumably for plumbing operations across multi dies

https://www.freepatentsonline.com/y2021/0192672.html

20210192672: CROSS GPU SCHEDULING OF DEPENDENT PROCESSES

A primary processing unit includes queues configured to store commands prior to execution in corresponding pipelines. The primary processing unit also includes a first table configured to store entries indicating dependencies between commands that are to be executed on different ones of a plurality of processing units that include the primary processing unit and one or more secondary processing units. The primary processing unit also includes a scheduler configured to release commands in response to resolution of the dependencies. In some cases, a first one of the secondary processing units schedules the first command for execution in response to resolution of a dependency on a second command executing in a second one of the secondary processing units. The second one of the secondary processing units notifies the primary processing unit in response to completing execution of the second command.

https://www.freepatentsonline.com/y2021/0191890.html

20210191890: SYSTEM DIRECT MEMORY ACCESS ENGINE OFFLOAD

Systems, devices, and methods for direct memory access. A system direct memory access (SDMA) device disposed on a processor die sends a message which includes physical addresses of a source buffer and a destination buffer, and a size of a data transfer, to a data fabric device. The data fabric device sends an instruction which includes the physical addresses of the source and destination buffer, and the size of the data transfer, to first agent devices. Each of the first agent devices reads a portion of the source buffer from a memory device at the physical address of the source buffer. Each of the first agent devices sends the portion of the source buffer to one of second agent devices. Each of the second agent devices writes the portion of the source buffer to the destination buffer.

Deleted member 90741 · Jul 9, 2021

Another patent application for concurrent traversal of the BVH tree
https://www.freepatentsonline.com/y2021/0209832.html

BOUNDING VOLUME HIERARCHY TRAVERSAL

Abstract

A technique for performing ray tracing operations is provided. The technique includes initiating bounding volume hierarchy traversal for a ray against geometry represented by a bounding volume hierarchy; identifying multiple nodes of the bonding volume hierarchy for concurrent intersection tests; and performing operations for the concurrent intersection tests concurrently.

Jawed · Jul 9, 2021

It seems to me that this model of concurrent traversal is possible on RDNA2.

The limitation appears to be the size of the "collection" of nodes to be queried as the BVH is traversed: the collection might be limited to 1024 nodes, for example. The collection is merely a set of IDs - the nodes themselves, as they are retrieved, can be analysed and then discarded once all the decisions for every ray have been derived.

The document is quite explicit in saying that a massive count of parallel ray queries is preferable, since they will, in aggregate, hide the latency of BVH fetch.

As a developer with the opportunity to write a custom traversal kernel, it appears that it's possible to jointly use concurrency and multiple queries per work item.

Deleted member 13524 · Jul 23, 2021

Some stuff from known leakers has been appearing about RDNA3, most of them confirming what @Bondrewd has been saying.
Wccftech made a compilation of the tweets, but I'll leave them here.

https://twitter.com/x/status/1415977882526961670

https://twitter.com/x/status/1416108295362781184

https://twitter.com/x/status/1418110594868146178

https://twitter.com/x/status/1418130096959885319

https://twitter.com/x/status/1411989878598819842

Navi 31 really does look like a monster on the larger SKU.
Are we looking at chiplets with 90CUs or more? 256MB of Infinity Cache per chiplet?

Or maybe it's the same 128MB per chiplet, but with an optional 128MB V-cache underneath. The 120CU SKU has no V-cache, but the fully enabled 180CU version does.

And in the middle of it all, 256bit GDDR6 sounds almost inadequate.. except of course for the massive cache amounts.

Regardless, these are exciting times ahead!

Bondrewd · Jul 23, 2021

ToTTenTranz said:
Are we looking at chiplets with 90CUs or more?

120 per GCD, but they're 30WGP and you should count them as such.

ToTTenTranz said:
256MB of Infinity Cache per chiplet?

No, those are discrete blobs attached to the MCDs.

ToTTenTranz said:
And in the middle of it all, 256bit GDDR6 sounds almost inadequate

Well technically yes but magic abound here.

Deleted member 13524 · Jul 23, 2021

Bondrewd said:
120 per GCD, but they're 30WGP and you should count them as such.

Each WGP has 4 CUs in RDNA3?
So the 2*GCD part has 180 CUs but could actually go up to 240?

Bondrewd said:
No, those are discrete blobs attached to the MCDs.

So there's no LLC in the gaphics core dies?

PSman1700 · Jul 23, 2021

Sounds like RDNA3 will be true monsters indeed.

Leoneazzurro5 · Jul 23, 2021

ToTTenTranz said:
Some stuff from known leakers has been appearing about RDNA3, most of them confirming what @Bondrewd has been saying.
Wccftech made a compilation of the tweets, but I'll leave them here.

https://twitter.com/x/status/1415977882526961670

https://twitter.com/x/status/1416108295362781184

https://twitter.com/x/status/1418110594868146178

https://twitter.com/x/status/1418130096959885319

https://twitter.com/x/status/1411989878598819842

Navi 31 really does look like a monster on the larger SKU.
Are we looking at chiplets with 90CUs or more? 256MB of Infinity Cache per chiplet?

Or maybe it's the same 128MB per chiplet, but with an optional 128MB V-cache underneath. The 120CU SKU has no V-cache, but the fully enabled 180CU version does.

And in the middle of it all, 256bit GDDR6 sounds almost inadequate.. except of course for the massive cache amounts.

Regardless, these are exciting times ahead!

If RDNA3 follows the patent we saw some time ago, IC is on the package, and acts also as a high-bandwidth interconnect between the modules (which are seen as a single GPU, load-balancing signals should be passed in the same way). <Of course there would be quite some cache on the dies, too, but it would not be "LLC".

Lurkmass · Jul 23, 2021

A small bus width with a large LLC can make for a reasonable HW design. Mobile GPUs proved that we can optimize deferred renderers by storing a small slice of the g-buffer in tile memory but there's a penalty to be paid for doing full screen passes since it will flush this memory. With a large LLC we can store our entire g-buffer in this memory and we don't have to worry about this penalty which is incidentally compatible with way how IMR GPUs operate ...

We are possibly so close to bringing back EQAA/MSAA for deferred renderers or we could afford to store more parameters in the g-buffer to enable more complex materials/shaders if these rumors hold ... (register pressure could very well be a thing of the past depending on the layout of the g-buffer)

Bondrewd · Jul 23, 2021

ToTTenTranz said:
Each WGP has 4 CUs in RDNA3?

There's no "CU" anymore.
Just WGP.

ToTTenTranz said:
So the 2*GCD part has 180 CUs but could actually go up to 240?

240 the old ways?
I think.

ToTTenTranz said:
So there's no LLC in the gaphics core dies?

None of, yes.

iroboto · Jul 23, 2021

Bondrewd said:
There's no "CU" anymore.
Just WGP.

240 the old ways?
I think.

None of, yes.

What is GCD acronym?

Deleted member 13524 · Jul 23, 2021

iroboto said:
What is GCD acronym?

Graphics Compute Die?
Graphics Core Die?

iroboto · Jul 23, 2021

ToTTenTranz said:
Graphics Compute Die?
Graphics Core Die?

so the official term for a gpu chiplet then

Leoneazzurro5 · Jul 23, 2021

ToTTenTranz said:
Graphics Compute Die?
Graphics Core Die?

According to AMD CCD is "Core Complex Die" so GCD should be "Graphic(s) Complex Die"

https://www.amd.com/system/files/documents/ryzen-master-quick-reference-guide.pdf

iroboto · Jul 23, 2021

Leoneazzurro5 said:
According to AMD CCD is "Core Complex Die" so GCD should be "Graphic(s) Complex Die"

exciting times really. I recall the day chiplets came to CPUs and shortly after leadership positions reversed on price/performance against Intel.
I really have super high expectations here for something similar given their past history, combined with 3D stacked cache, it's going to have significant price/performance.
Curious to see if it can come to APU form factors in the future.

Bondrewd · Jul 23, 2021

iroboto said:
it's going to have significant price/performance.

nnnnnnnnnope.
Not in a thousand years.
MCP GPUs are a win more setup.
You pay $2500 and you get more!

iroboto said:
Curious to see if it can come to APU form factors in the future.

Yeah, later (think late'23 timeline and all).

Deleted member 90741 · Jul 23, 2021

ToTTenTranz said:
Some stuff from known leakers has been appearing about RDNA3, most of them confirming what @Bondrewd has been saying.

Greymon55 is bascially Broly_X1.
Broly_X1 had to delete his posts because of certain reasons, but that guy nailed everything right.
While exciting for outsiders to get a sneak peek, as someone who also work with a lot of NDA tech it is a also concerning.

Deleted member 90741 · Jul 23, 2021

Leoneazzurro5 said:
If RDNA3 follows the patent we saw some time ago, IC is on the package, and acts also as a high-bandwidth interconnect between the modules (which are seen as a single GPU, load-balancing signals should be passed in the same way). <Of course there would be quite some cache on the dies, too, but it would not be "LLC".

Bondrewd said:
No, those are discrete blobs attached to the MCDs.

It should be on MCD.
I imagine AMD would take the best of both worlds. N5P GCD for absolute logic density and performance and N6 MCD with HD/SRAM optimized libraries for lower cost per MB IC.
N5P SRAM density gain over N7/N6 is very mediocre.
512MB SRAM on N6 with optimized libraries would only be 280-300m2 (Figures estimated from wikichip data, behind paywall). On N5 hardly any better around 250+mm2 but much costlier.
But all those logic blocks can scale very high almost 1.48x with N5P (assuming AMD goes with N5P for GPUs else 1.85x on plain N5)
I suppose 2x GCD + 1x MCD would be closing in around 1000mm2 or maybe even more. Will cost a pretty penny.

I don't know if @Bondrewd can give a hint if N5 or N5P

AMD: RDNA 3 Speculation, Rumours and Discussion

Digidi

Jawed

Deleted member 90741

Guest

Deleted member 90741

Guest

Jawed

Deleted member 13524

Guest

Bondrewd

Deleted member 13524

Guest

PSman1700

Leoneazzurro5

Lurkmass

Bondrewd

iroboto

Daft Funk

Deleted member 13524

Guest

iroboto

Daft Funk

Leoneazzurro5

iroboto

Daft Funk

Bondrewd

Deleted member 90741

Guest

Deleted member 90741

Guest

Similar threads