AMD Execution Thread [2023]

Status
Not open for further replies.
The “cancelled” high end RDNA 4 gaming chips were supposedly using multiple compute dies. Given how radical of an idea it is you would expect the design to go through the ringer in the lab before making it on to a product roadmap. If the rumors are true it would be interesting to know if it’s a design or manufacturing issue. The latter is more understandable.
My guess would be that it's a cost issue first and foremost, both in how much it would take to make it work and then how much it would actually cost to produce in comparison to "simpler" single chip options / competition.
 
The “cancelled” high end RDNA 4 gaming chips were supposedly using multiple compute dies. Given how radical of an idea it is you would expect the design to go through the ringer in the lab before making it on to a product roadmap. If the rumors are true it would be interesting to know if it’s a design or manufacturing issue. The latter is more understandable.
Probably several issues but I think a leading one is the fact that it just doesn't work. Performance doesn’t scale.
 
It needs to be reminded that Bondrewd, despite posting as if he's an authority and has direct inside sources, admits that he doesn't really know what he purports to know when it comes down to it, and on multiple occasions has struggled(or disappeared) when it comes time to explain why he was very wrong about what he said would happen.

He's not an authority, but he's very good at pretending he is through very confidently worded posts.

Ah, I had thought Bondrewd was in some way affiliated with AMD due to the confidence and authority with which he posts. Though to be fair he didn't actually state that so perhaps that's my bad.

That could only happen if RX 8600 had ALU count that's comparable to RX 6900/7900 - I'm not really sure that's possible for a 200 mm2 die even on a 3 nm node.

Sorry I didn't really phrase that question well. I was really asking if architecturally these RDNA4 parts will show massive increases in RT performance compared to their RDNA3 counterparts. E.g. 7600 -> 8600.

AMD raster performance is good, and FSR while limited does at least offer something to minimise the impact of not having a true DLSS 2 equivalent. But RT performance has barely improved from RDNA2 to RDNA3, while Nvidia have leapt further ahead every generation. AMD started late at RT, with the worst performing RT solution, and haven't really progressed since.

My gut feeling at this point is that the cancellation of top and high end RDNA4 parts would mean that AMD RT is still stuck in the mud. I can't see anyone spending high end $$$ in 2024/5 on a GPU that only has low end RT.
 
My guess would be that it's a cost issue first and foremost, both in how much it would take to make it work and then how much it would actually cost to produce in comparison to "simpler" single chip options / competition.
Yeah, probably Nvidia
Ah, I had thought Bondrewd was in some way affiliated with AMD due to the confidence and authority with which he posts. Though to be fair he didn't actually state that so perhaps that's my bad.



Sorry I didn't really phrase that question well. I was really asking if architecturally these RDNA4 parts will show massive increases in RT performance compared to their RDNA3 counterparts. E.g. 7600 -> 8600.

AMD raster performance is good, and FSR while limited does at least offer something to minimise the impact of not having a true DLSS 2 equivalent. But RT performance has barely improved from RDNA2 to RDNA3, while Nvidia have leapt further ahead every generation. AMD started late at RT, with the worst performing RT solution, and haven't really progressed since.

My gut feeling at this point is that the cancellation of top and high end RDNA4 parts would mean that AMD RT is still stuck in the mud. I can't see anyone spending high end $$$ in 2024/5 on a GPU that only has low end RT.
Or - since they recently projected massive gains in HPC/AI GPGPU selling - maybe they canceled lower margin Cowos capacity for RDNA4. Who knows? in COVID time substrate was limiting factor now is Cowos.
 
so what ?
It's all SoIC, there's no passive slab or RDL d2d links or anything.
Litho-defined packages like InFO or CoWoS are no gusto for long bois and their capacity for >1x reticle sheets is limited so AMD sidestepped the issue.

AMD will use InFO extensively for client Zen6 parts but those are client sub-reticle things so no caveats there.
 
It's all SoIC, there's no passive slab or RDL d2d links or anything.
Litho-defined packages like InFO or CoWoS are no gusto for long bois and their capacity for >1x reticle sheets is limited so AMD sidestepped the issue.

AMD will use InFO extensively for client Zen6 parts but those are client sub-reticle things so no caveats there.
ah, RDNA4 is CoW_L.
 
I don't think I'd believe those CU counts even for RDNA5 unless they had massive arch changes in between but a massively complex die/chiplet approach fits the bill.
Yeah, I was just trying to make sense of all the leaks and conjectures, but got a bit lost in the process ;) I've corrected the table, and I will probably update it again when more details on Navi 32, MI300X, and RDNA4 are available.


The point is, if top-end Navi4x would really feature massive 200+ CU counts, then lower-end variants wouldn't stick to CU counts from Polaris/Vega era as well. I assume Navi 43/44 would be used in RX 8400 / RX 8500 and RX 8600 / RX 8700 models, which would have performed on the scale of RX 7600 / RX 7700 (or RX 6600 / RX 6700) and RX 7800 / RX 7900 (or RX 6800 / RX 6900) respectively. This way it still makes sense for AMD to release these lower-end RDNA4 parts in 2024-2025, even if high-end parts are cancelled.


By my estimation, such uplift would be very possible with the TSMC N3 process, where transistor density should reach 224 MTr/mm2, a 60% improvement comparing to TSMC N5 at 138 MTr/mm2.

For Navi 44 and Navi 43 designed as monolithic dies under the NAVI4M project using the N3 process, I think AMD could fit
  • 36 CUs on a 107 mm2 die area, or 48 CUs on a 140 mm2 die area; and
  • 72 CUs on a 207 mm2 die area, or 96 CUs on a 270 mm2 die area.
As for the conjectured 200+ CU all-chiplet design under the NAVI4C project, Navi 42 and Navi 41 could have up to
  • 144 CUs on a 390 mm2 die area, or 176 CUs on a 470 mm2 die area; and
  • 192 CUs on a 515 mm2 die area, or 256 CUs on a 675 mm2 die area.


Now the higher die area and CU counts probably look too optimistic and these densities are not guaranteed in mass production, at least not initially. And if they used the TSMC N4 process instead, its transistor density increase is fairly modest at 144 MTr/mm2 (over 138 MTr/mm2 for N5, 114 MrT/mm2 for N6, 91 MTr/mm2 for N7), so performance uplifts would be moderate as well, with Navi 44/43 performing similarily to RX 6500, RX 6600/7600, RX 6700/7700, and RX 6800/7800.


FYI I tried to estimate transistor budget and die area for compute units only, filtering out memory controllers and associated Infinity cache from the total, across the range from Polaris/Vega to RDNA3, and in my conjecture, 1 CU in each architecture took:
  • ~156 Mn transistors in Polaris/Vega,
  • ~225 Mn in RDNA1,
  • ~300 Mn in RDNA2, and
  • ~400 Mn in RDNA3.

Therefore I allocated ~500 Mn transistors per 1 CU for RDNA4 in my calculations of estimated die sizes, to account for possible architectural improvements.


So at least I tried my best to make these conjectures of die sizes and CU counts look more plausible with my Excel sheet estimates of transistor densities - and if AMD would not reach these CU counts in real products, hopefully they could compensate by ramping up frequencies to 3.5 GHz and beyond.
 
Last edited:
multiple RDNA5 design considered:

halo variant:

9 SED with 3 Shader Arrays with 5 WGP and 4 Shader Arrays with 5 WGP
135 - 180 WGP
GDDR7
L2 cache in each SED
clock target ~4GHz
450W
 
Sure, 270 to 370 CUs at 4 GHz, for 270 to 370 TFLOPS! Looks plausible by 2027. Why stop at 370 though, ramp it all up to 550 CUs and 550 TFLOPS! This is exactly what the 'sources' in my head have been telling me all along - or maybe it's my Excel sheet with a few dozen rows filled down from that TFLOPS formula? I'm looking at you, Paul Eccleston of RedGamingTech, you funny English bloke!

Also don't hesitate to price the cards at US$2500, $3500 and $5500, because AI companies would be snatching them by thousands at a mere $10 per 1 TFLOPS - some of them are already making direct orders with AMD for hundreds of RX 7900 XTX gaming cards...

 
Last edited:
Also don't hesitate to price the cards at US$2500, $3500 and $5500, because AI companies would be snitching them by thousands at a mere $10 per 1 TFLOPS - some of them are already making direct orders with AMD for hundreds of RX 7900 XTX gaming cards
FYI, RDNA5 will be used as a data center offering too (for real).
 
That might not be so awful, so long as these mainstream parts finally have massively improved RT capabilities beyond RDNA2. AMD are going to be taking RT seriously for RDNA4 ... right?
if architecturally these RDNA4 parts will show massive increases in RT performance compared to their RDNA3 counterparts..

There was a long discussion thread about RDNA2 raytracing implementation.


AMD implements ray testing and BVH traversal with shader programs - unlike NVidia approach which uses dedicated hardware for traversal, be aided by dedicated 'Tensor' cores (for matrix multiply-accumulate operations) to perform AI denoising, with no impact on shader performance.

There is a special shader instruction IMAGE_BVH_INTERSECT_RAY which tests bounding box coordinates against ray coordinates. This instruction takes a 64-bit pointer to 5 bounding box coordinates (1 parent node and 4 child nodes) in the BVH tree, or alternately 3 triangle coordinates, which are stored in local video memory as a texture resource, then loads these coordinates and tests each bounding box or triangle against ray coordinates in VGPR registers (ray origin, direction and inverse direction coordinates), then returns a texture descriptor and up to 4 pointers to child BVH nodes.


This is both memory bandwidth intensive and computationally complex, because it involves:

1) a memory access operation which loads up to 4 memory pointers (64-bit int) to child nodes and up to 30 FP32 (i.e. 32-bit single precision floating point) XYZ vertex coordinates (i.e. 2 XYZ vertices for each of the 5 axis-aligned bounding boxes (AABB)) - so it' 4×8 + (5×2×3)×4=32+120=152 bytes for BVH testing, and up to 36 bytes for triangle testing (9 FP32 values for 3 XYZ vertices).

2) a series of floating point matrix multiply-accumulate operations on these coordinates, to solve a system of linear equations and find the intersection of a line (made of ray origin and direction points) and each side of a bounding box (6 sides in each of the 5 bounding boxes), probably a dozen compares and few dozens multiply-accumulate operations.

RDNA3 also adds DS_BVH_STACK_RTN_B32 instruction which uses LDS memory to store a small per-thread stack of memory pointers to BVH nodes (8 to 64 entries).


Therefore RT performance can be increased by

1) using high bandwidh video memory and larger caches with improved fetching logic and lower latencies, and

2) improving the speed of matrix operations, either using general purpose matrix multiply-accumulate intstructions (such as WaveMMA blocks first introduced with RDNA3, which AMD calls 'AI blocks' due to support for INT8 and BF16 formats), or a dedicated fixed-function solver block which calculates the solution to line-box intersection from coordinates loaded to its registers.



FYI AMD already uses dedicated 'Ray Accelerator' blocks for BVH testing, implemented alongside TMU texture fetching logic; RDNA2 could already make 4 ray-to-BVH intersection tests per clock, and RDNA3 further improves intersection tests by sorting the child BVH nodes in the process - so I guess it's mostly about general improvements like higher memory bandwidth, smaller cache latencies, larger caches, more register memories, and better scheduling across different execution blocks.

More details:


Suggestions for deeper architectural improvements include dedicated BVH tree building/traversal (i.e. NVidia and Intel approach) with a dedicated stack wemory for ray/BVH coordinates; AMD has a patent US20230206543 on a 'hardware' (fixed-function) traversal engine describing these techniques.



In 2022 Sony also received a similar patent, WO2022040481 System and Method for Accelerated Ray Tracing with Asynchronous Operation and Ray Transformation, which describes fixed-function raytracing units for BVH traversal, which could be the implementation used in future PlayStation 5.



A recent AMD paper on using neural networks for BVH traversal:

Neural Intersection Function. S. Fujieda, C. C. Kao, T. Harada


PS. I replied a few days ago then realized how I vastly underestimated the complexity required for intersection testing, so I've made a new post and removed the old one.
 
Last edited:
It seems like AMD implements ray testing and BVH traversal with shader blocks - unlike NVidia approach which uses dedicated hardware for traversal, though aided by dedicated 'Tensor' cores (for matrix multiply-accumulate operations).
Ray traversal is in h/w on both (ray accelerator is that h/w on the AMD side), hit evaluation / intersection testing is in s/w on AMD and in h/w on Nv (and Intel). Tensor cores are not used for RT in any capacity on any GPU. I doubt that WMMAs are used in RDNA either since you'll need to go in FP16 or lower for that and it's just not enough precision.
 
Ray traversal is in h/w on both (ray accelerator is that h/w on the AMD side), hit evaluation / intersection testing is in s/w on AMD and in h/w on Nv (and Intel).
RDNA2/3 uses driver-generated shaders for BVH building, BVH traversal and ray-BVH intersection testing. These shaders are scheduled on the WGPs to run in parallel with your application's ray generation and ray hit/miss shaders.

https://gpuopen.com/learn/rdna-performance-guide/#ray-tracing
Our traversal system runs on the RDNA WGP and uses LDS extensively.
  • Avoid using too much groupshared memory in your ray tracing shaders.
  • Use an 8×4 threadgroup size to optimally use the RDNA WGP.

The BVH structure format is fixed with up to 4 child nodes. There is one single specialized IMAGE_BVH_INTERSECT_RAY instruction for ray-BVH intersection, which can test a ray against four bounding boxes in one single clock - so this one is 'hardware', i.e. uses a specialized fixed function block for intersection testing within the TMUs, instead of general-purpose ALUs. There are no specialized shader instructions for other raytracing tasks.


BTW there is a Radeon Raytracing Analyzer (RRA) tool which can profile shader code and visualize the BVH by plugging directly into the graphics driver:
https://gpuopen.com/radeon-raytracing-analyzer/


Tensor cores are not used for RT in any capacity on any GPU.

Hmm.... I recall Nvidia did advertise AI denoising with Tensor cores at the time of Volta (Titan V) and Turing (RTX 2000 series):

The OptiX AI denoiser is a neural network based denoiser designed to process rendered images to remove Monte Carlo noise. The OptiX denoiser can be especially important for interactive previews in order to give artists a good sense for what their final images might look like.
The OptiX AI denoising technology, combined with the new NVIDIA Tensor Cores in the Quadro GV100, delivers 3x the performance of previous-generation GPUs and enables fluid interactivity in complex scenes.
https://developer.nvidia.com/optix-denoiser
https://developer.nvidia.com/blog/the-nvidia-optix-sdk-release-7-2/

NVIDIA Turing GPU Architecture WP-09183-001_v01
Appendix D Ray Tracing Overview

Currently, NVIDIA is making use of both AI-based and non-AI-based algorithms for denoising, choosing whatever is best for a particular application. In the future we expect AI-based denoising to continue to improve and replace non-AI-based methods, repeating the trend that has been seen in many other image-related applications for AI.
https://www.nvidia.com/en-us/geforce/news/geforce-rtx-20-series-turing-architecture-whitepaper/
 
Last edited:
The BVH stricture format is fixed with up to 4 child nodes. There is one single specialized IMAGE_BVH_INTERSECT_RAY instruction for ray-BVH intersection, which can process ray origin/direction against four bounding boxes in one single clock. There are no specialized shader instructions for other raytracing tasks
Ray testing is done in h/w, the rest is running on SIMDs. RA is a part of WGP too btw so saying that something is "done on WGP" doesn't automatically mean that it's done by a shader.

Hmm.... I recall Nvidia did advertise AI denoising with Tensor cores at the time of Volta (Titan V) and Turing (RTX 2000 series):
Denoising is not a part of ray tracing from h/w and API point of view and can be done in many ways. Using ML for this is obviously not very universal as AMD h/w lacks the capability (or it's too slow to be useful).
 
1) a memory access operation which loads up to 120 FP32 (i.e. 32-bit single precision floating point) XYZ coordinates from local video memory (i.e. 8 points in each of the 5 bounding boxes), so it's anything from 36 bytes for triangle testing and from 192 to 480 bytes for BVH testing, and
This should be wrong.
Acceleration structures usually use axis aligned bounding boxes, so you need only a min and max value per dimension to define its range, giving 6 floats for a box in 3D.

The alternative is oriented bounding boxes. You could define this with 8 corners, but that's not common. You would need to calculate planes by taking cross products, and because it's just a rotated box, you could store only 4 corners to calculate the 6 bounding planes from that.
But i have never seen this in practice. More likely people use a 3x3 (or 4x4 / 4x3) matrix to store the inverse orientation, use this to transform rays or points to the local space of the box, and use a AABB as above to define the box range in its local space. This would be 9+6 floats at minimum.

OOB can be a much tighter fit. Imagine a tessellated cube model at some 45 degrees orientation. If the OOBs can have the same orientation as the cube, our boxes will fit tightly and will have much smaller surface area and volume than possible with using AABBs.
But this is not common for raytracing because of the additional storage and BW cost.
Though, i do remember a physics engine dev switching from AABB to OBB and he got a noticeable speedup.

But i can build up some reasonable speculation:

When NV did research on ray reordering (long before RTX), which tries to intersect batches of rays against a batch of BVH to reduce bandwidth, they used the term 'treelet' for the batch of BVH.
A treelet is a sub branch of BVH - basically a mini BVH cutting away the path to the root and children. Maybe spanning just 3 or 4 levels. Geometrically this should represent a patch of surface covering a small spatial range.
It is speculated they use treelets in HW, at least to achieve compression. Because the treelet is small spatially, we can give it a bounding box and store coordinates relative to that with fewer bits like 16.
(EDIT: We can compress indices too. We want all nodes of a treelet in continuss memory, and we'll have less than 256 nodes in total, so 8 bit indices (isntead 32 or even 64 bits) with a single base index per treelet should do.)
This is maybe the reason why NV needs less memory for BVH than AMD does. (Although it depends on branching factor too (number of children per node). We know AMD uses 4, Intel uses 6, but we don't know what NV does.)

On this opportunity it becomes attractive to use OBB instead AABB for the treelet. We have few treelets, so the storage cost of the additional matrix is low.
We then rotate the rays only once to the local orientation of the treelet, which is low cost too. And due to the tight fit, we will skip over many boxes we otherwise would have to traverse.
When building the BVH, we can find a good orientation for a cluster of geometry easily by doing a singular value decomposition on surface normals. SVD for 3x3 is pretty fast.
We should get the best of both worlds.

I would not wonder if NV is already doing this, and others might follow. It's an obvious optimization, and additional complexity seems low enough for fixed function HW.
But i also think that things like this are the main hurdle to allow access, modification, or streaming of BVH data structures with some general API.
That's why i have little hope DXR will add such BVH API while IHVs may still want to modify their data structures for newer architectures.
It would be much easier if IHVs release specifications about their data structures, which software devs then have to implement per chip, if they really need such BVH access.
 
Last edited:
Status
Not open for further replies.
Back
Top