Why would they bother when they have their own matrix crunchers too?maybe they should buy a couple of Nvidia AI hardware to do hard work for them
Halo refers to the highest end "super model"What halo ?
Why would they bother when they have their own matrix crunchers too?maybe they should buy a couple of Nvidia AI hardware to do hard work for them
Halo refers to the highest end "super model"What halo ?
Classical many thousand dollar halo parts a-la Titan Z or 295x2 of yore just without the caveats.What halo ?
My guess would be that it's a cost issue first and foremost, both in how much it would take to make it work and then how much it would actually cost to produce in comparison to "simpler" single chip options / competition.The “cancelled” high end RDNA 4 gaming chips were supposedly using multiple compute dies. Given how radical of an idea it is you would expect the design to go through the ringer in the lab before making it on to a product roadmap. If the rumors are true it would be interesting to know if it’s a design or manufacturing issue. The latter is more understandable.
Probably several issues but I think a leading one is the fact that it just doesn't work. Performance doesn’t scale.The “cancelled” high end RDNA 4 gaming chips were supposedly using multiple compute dies. Given how radical of an idea it is you would expect the design to go through the ringer in the lab before making it on to a product roadmap. If the rumors are true it would be interesting to know if it’s a design or manufacturing issue. The latter is more understandable.
It needs to be reminded that Bondrewd, despite posting as if he's an authority and has direct inside sources, admits that he doesn't really know what he purports to know when it comes down to it, and on multiple occasions has struggled(or disappeared) when it comes time to explain why he was very wrong about what he said would happen.
He's not an authority, but he's very good at pretending he is through very confidently worded posts.
That could only happen if RX 8600 had ALU count that's comparable to RX 6900/7900 - I'm not really sure that's possible for a 200 mm2 die even on a 3 nm node.
Yeah, probably NvidiaMy guess would be that it's a cost issue first and foremost, both in how much it would take to make it work and then how much it would actually cost to produce in comparison to "simpler" single chip options / competition.
Or - since they recently projected massive gains in HPC/AI GPGPU selling - maybe they canceled lower margin Cowos capacity for RDNA4. Who knows? in COVID time substrate was limiting factor now is Cowos.Ah, I had thought Bondrewd was in some way affiliated with AMD due to the confidence and authority with which he posts. Though to be fair he didn't actually state that so perhaps that's my bad.
Sorry I didn't really phrase that question well. I was really asking if architecturally these RDNA4 parts will show massive increases in RT performance compared to their RDNA3 counterparts. E.g. 7600 -> 8600.
AMD raster performance is good, and FSR while limited does at least offer something to minimise the impact of not having a true DLSS 2 equivalent. But RT performance has barely improved from RDNA2 to RDNA3, while Nvidia have leapt further ahead every generation. AMD started late at RT, with the worst performing RT solution, and haven't really progressed since.
My gut feeling at this point is that the cancellation of top and high end RDNA4 parts would mean that AMD RT is still stuck in the mud. I can't see anyone spending high end $$$ in 2024/5 on a GPU that only has low end RT.
RDNA4 wasn't CoWoS.maybe they canceled lower margin Cowos capacity for RDNA4
so what ?RDNA4 wasn't CoWoS.
It's all SoIC, there's no passive slab or RDL d2d links or anything.so what ?
ah, RDNA4 is CoW_L.It's all SoIC, there's no passive slab or RDL d2d links or anything.
Litho-defined packages like InFO or CoWoS are no gusto for long bois and their capacity for >1x reticle sheets is limited so AMD sidestepped the issue.
AMD will use InFO extensively for client Zen6 parts but those are client sub-reticle things so no caveats there.
Yeah, I was just trying to make sense of all the leaks and conjectures, but got a bit lost in the process I've corrected the table, and I will probably update it again when more details on Navi 32, MI300X, and RDNA4 are available.I don't think I'd believe those CU counts even for RDNA5 unless they had massive arch changes in between but a massively complex die/chiplet approach fits the bill.
FYI, RDNA5 will be used as a data center offering too (for real).Also don't hesitate to price the cards at US$2500, $3500 and $5500, because AI companies would be snitching them by thousands at a mere $10 per 1 TFLOPS - some of them are already making direct orders with AMD for hundreds of RX 7900 XTX gaming cards
That might not be so awful, so long as these mainstream parts finally have massively improved RT capabilities beyond RDNA2. AMD are going to be taking RT seriously for RDNA4 ... right?
if architecturally these RDNA4 parts will show massive increases in RT performance compared to their RDNA3 counterparts..
Ray traversal is in h/w on both (ray accelerator is that h/w on the AMD side), hit evaluation / intersection testing is in s/w on AMD and in h/w on Nv (and Intel). Tensor cores are not used for RT in any capacity on any GPU. I doubt that WMMAs are used in RDNA either since you'll need to go in FP16 or lower for that and it's just not enough precision.It seems like AMD implements ray testing and BVH traversal with shader blocks - unlike NVidia approach which uses dedicated hardware for traversal, though aided by dedicated 'Tensor' cores (for matrix multiply-accumulate operations).
RDNA2/3 uses driver-generated shaders for BVH building, BVH traversal and ray-BVH intersection testing. These shaders are scheduled on the WGPs to run in parallel with your application's ray generation and ray hit/miss shaders.Ray traversal is in h/w on both (ray accelerator is that h/w on the AMD side), hit evaluation / intersection testing is in s/w on AMD and in h/w on Nv (and Intel).
Our traversal system runs on the RDNA WGP and uses LDS extensively.
- Avoid using too much groupshared memory in your ray tracing shaders.
- Use an 8×4 threadgroup size to optimally use the RDNA WGP.
Tensor cores are not used for RT in any capacity on any GPU.
The OptiX AI denoiser is a neural network based denoiser designed to process rendered images to remove Monte Carlo noise. The OptiX denoiser can be especially important for interactive previews in order to give artists a good sense for what their final images might look like.
https://developer.nvidia.com/optix-denoiserThe OptiX AI denoising technology, combined with the new NVIDIA Tensor Cores in the Quadro GV100, delivers 3x the performance of previous-generation GPUs and enables fluid interactivity in complex scenes.
https://www.nvidia.com/en-us/geforce/news/geforce-rtx-20-series-turing-architecture-whitepaper/NVIDIA Turing GPU Architecture WP-09183-001_v01
Appendix D Ray Tracing Overview
Currently, NVIDIA is making use of both AI-based and non-AI-based algorithms for denoising, choosing whatever is best for a particular application. In the future we expect AI-based denoising to continue to improve and replace non-AI-based methods, repeating the trend that has been seen in many other image-related applications for AI.
Ray testing is done in h/w, the rest is running on SIMDs. RA is a part of WGP too btw so saying that something is "done on WGP" doesn't automatically mean that it's done by a shader.The BVH stricture format is fixed with up to 4 child nodes. There is one single specialized IMAGE_BVH_INTERSECT_RAY instruction for ray-BVH intersection, which can process ray origin/direction against four bounding boxes in one single clock. There are no specialized shader instructions for other raytracing tasks
Denoising is not a part of ray tracing from h/w and API point of view and can be done in many ways. Using ML for this is obviously not very universal as AMD h/w lacks the capability (or it's too slow to be useful).Hmm.... I recall Nvidia did advertise AI denoising with Tensor cores at the time of Volta (Titan V) and Turing (RTX 2000 series):
This should be wrong.1) a memory access operation which loads up to 120 FP32 (i.e. 32-bit single precision floating point) XYZ coordinates from local video memory (i.e. 8 points in each of the 5 bounding boxes), so it's anything from 36 bytes for triangle testing and from 192 to 480 bytes for BVH testing, and