AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Split Frame Rendering for chiplet Design? o_O
https://www.freepatentsonline.com/10922868.html
So the diagrams in this document are interesting because the output of Hull Shader and Tessellator are synchronisation/data-transfer workloads between two APDs (accelerated processing devices). This is the "gnarly" part of a multi-chiplet architecture because down-stream function blocks determine which chiplet will use which chunks of output produced by HS or TS. The routing of the work is "late", when screen-space is used to determine workload apportionment.

The synchronisation/data-transfer tasks require queues ("FIFO" in the document), which is where an L3 (Infinity Cache) would come in. The document locates the FIFOs within each APD, but if there's a monster chunk of L3, let's say 512MB, shared by two chiplets, that would seem to be a preferable place. These buffers would not waste die space if they were dedicated memory blocks within each chiplet when tessellation is not being used.

AMD always struggled with stream out functionality of the geometry shader, compared with NVidia. NVidia better-handled SO with on-chip buffers (cache) whereas AMD always decided to use off-chip memory (AMD's drivers over time messed about with GS parameters that tried to avoid the worst problems associated with the volume of data produced by GS). Similarly, tessellation has always caused AMD problems because on-die buffering and work distribution were very limited. In the end SO buffering is effectively no different from the FIFOs that are required to support HS and TS work-distribution. So whether we're talking about a single die or chiplets, fast, close, memory is a key part of the solution.

So if AMD is to solve the FIFO problem properly, it will need to use a decent chunk of on-package memory.

Similarly the "tile-binned" rasterisation in:

https://www.freepatentsonline.com/10957094.html

would seem to depend upon an L3. We've seen from NVidia's tile-binned rasterisation that the count of triangles/vertices that can be cached on die varies according to the count of attributes associated with each vertex (and the per-pixel count of bytes defined by the format of the render target). I don't think we've ever really seen a performance degradation analysis for NVidia in games according to the per-vertex/-pixel data load, but as time has gone by it appears NVidia has substantially increased the size of on-die buffers to support tile-binned rasterisation.

It seems to me that a monster Infinity Cache lies at the heart of these algorithms for AMD. Well, I imagine that comes across as "stating the obvious", but NVidia has been using a reasonably large on-die cache for quite a long time so it's time AMD caught up.

In theory, with RDNA 2, Infinity Cache is already making tessellation work better. But I don't remember seeing any analysis.

Stupid question time: I can't find a speculation thread for NVidia GPUs after Ampere. Is there one?
 
A couple of new patents presumably for plumbing operations across multi dies


https://www.freepatentsonline.com/y2021/0192672.html

20210192672: CROSS GPU SCHEDULING OF DEPENDENT PROCESSES
A primary processing unit includes queues configured to store commands prior to execution in corresponding pipelines. The primary processing unit also includes a first table configured to store entries indicating dependencies between commands that are to be executed on different ones of a plurality of processing units that include the primary processing unit and one or more secondary processing units. The primary processing unit also includes a scheduler configured to release commands in response to resolution of the dependencies. In some cases, a first one of the secondary processing units schedules the first command for execution in response to resolution of a dependency on a second command executing in a second one of the secondary processing units. The second one of the secondary processing units notifies the primary processing unit in response to completing execution of the second command.

upload_2021-6-26_19-9-5.png

https://www.freepatentsonline.com/y2021/0191890.html

20210191890: SYSTEM DIRECT MEMORY ACCESS ENGINE OFFLOAD
Systems, devices, and methods for direct memory access. A system direct memory access (SDMA) device disposed on a processor die sends a message which includes physical addresses of a source buffer and a destination buffer, and a size of a data transfer, to a data fabric device. The data fabric device sends an instruction which includes the physical addresses of the source and destination buffer, and the size of the data transfer, to first agent devices. Each of the first agent devices reads a portion of the source buffer from a memory device at the physical address of the source buffer. Each of the first agent devices sends the portion of the source buffer to one of second agent devices. Each of the second agent devices writes the portion of the source buffer to the destination buffer.

upload_2021-6-26_19-9-35.png
 
Another patent application for concurrent traversal of the BVH tree
https://www.freepatentsonline.com/y2021/0209832.html

BOUNDING VOLUME HIERARCHY TRAVERSAL

Abstract

A technique for performing ray tracing operations is provided. The technique includes initiating bounding volume hierarchy traversal for a ray against geometry represented by a bounding volume hierarchy; identifying multiple nodes of the bonding volume hierarchy for concurrent intersection tests; and performing operations for the concurrent intersection tests concurrently.



upload_2021-7-9_11-53-33.png
upload_2021-7-9_11-53-46.png
 
It seems to me that this model of concurrent traversal is possible on RDNA2.

The limitation appears to be the size of the "collection" of nodes to be queried as the BVH is traversed: the collection might be limited to 1024 nodes, for example. The collection is merely a set of IDs - the nodes themselves, as they are retrieved, can be analysed and then discarded once all the decisions for every ray have been derived.

The document is quite explicit in saying that a massive count of parallel ray queries is preferable, since they will, in aggregate, hide the latency of BVH fetch.

As a developer with the opportunity to write a custom traversal kernel, it appears that it's possible to jointly use concurrency and multiple queries per work item.
 
Some stuff from known leakers has been appearing about RDNA3, most of them confirming what @Bondrewd has been saying.
Wccftech made a compilation of the tweets, but I'll leave them here.









Navi 31 really does look like a monster on the larger SKU.
Are we looking at chiplets with 90CUs or more? 256MB of Infinity Cache per chiplet?

Or maybe it's the same 128MB per chiplet, but with an optional 128MB V-cache underneath. The 120CU SKU has no V-cache, but the fully enabled 180CU version does.

And in the middle of it all, 256bit GDDR6 sounds almost inadequate.. except of course for the massive cache amounts.

Regardless, these are exciting times ahead!
 
Some stuff from known leakers has been appearing about RDNA3, most of them confirming what @Bondrewd has been saying.
Wccftech made a compilation of the tweets, but I'll leave them here.









Navi 31 really does look like a monster on the larger SKU.
Are we looking at chiplets with 90CUs or more? 256MB of Infinity Cache per chiplet?

Or maybe it's the same 128MB per chiplet, but with an optional 128MB V-cache underneath. The 120CU SKU has no V-cache, but the fully enabled 180CU version does.

And in the middle of it all, 256bit GDDR6 sounds almost inadequate.. except of course for the massive cache amounts.

Regardless, these are exciting times ahead!

If RDNA3 follows the patent we saw some time ago, IC is on the package, and acts also as a high-bandwidth interconnect between the modules (which are seen as a single GPU, load-balancing signals should be passed in the same way). <Of course there would be quite some cache on the dies, too, but it would not be "LLC".
 
A small bus width with a large LLC can make for a reasonable HW design. Mobile GPUs proved that we can optimize deferred renderers by storing a small slice of the g-buffer in tile memory but there's a penalty to be paid for doing full screen passes since it will flush this memory. With a large LLC we can store our entire g-buffer in this memory and we don't have to worry about this penalty which is incidentally compatible with way how IMR GPUs operate ...

We are possibly so close to bringing back EQAA/MSAA for deferred renderers or we could afford to store more parameters in the g-buffer to enable more complex materials/shaders if these rumors hold ... (register pressure could very well be a thing of the past depending on the layout of the g-buffer)
 
According to AMD CCD is "Core Complex Die" so GCD should be "Graphic(s) Complex Die"
exciting times really. I recall the day chiplets came to CPUs and shortly after leadership positions reversed on price/performance against Intel.
I really have super high expectations here for something similar given their past history, combined with 3D stacked cache, it's going to have significant price/performance.
Curious to see if it can come to APU form factors in the future.
 
Some stuff from known leakers has been appearing about RDNA3, most of them confirming what @Bondrewd has been saying.
Greymon55 is bascially Broly_X1.
Broly_X1 had to delete his posts because of certain reasons, but that guy nailed everything right.
While exciting for outsiders to get a sneak peek, as someone who also work with a lot of NDA tech it is a also concerning.
 
If RDNA3 follows the patent we saw some time ago, IC is on the package, and acts also as a high-bandwidth interconnect between the modules (which are seen as a single GPU, load-balancing signals should be passed in the same way). <Of course there would be quite some cache on the dies, too, but it would not be "LLC".
No, those are discrete blobs attached to the MCDs.
It should be on MCD.
I imagine AMD would take the best of both worlds. N5P GCD for absolute logic density and performance and N6 MCD with HD/SRAM optimized libraries for lower cost per MB IC.
N5P SRAM density gain over N7/N6 is very mediocre.
512MB SRAM on N6 with optimized libraries would only be 280-300m2 (Figures estimated from wikichip data, behind paywall). On N5 hardly any better around 250+mm2 but much costlier.
But all those logic blocks can scale very high almost 1.48x with N5P (assuming AMD goes with N5P for GPUs else 1.85x on plain N5)
I suppose 2x GCD + 1x MCD would be closing in around 1000mm2 or maybe even more. Will cost a pretty penny.

I don't know if @Bondrewd can give a hint if N5 or N5P
 
Status
Not open for further replies.
Back
Top