Preordered
IMe too. Missed the special edition but it was a little pricey anyway
Preordered
For the past 20+ years, GPUs and software have consistently evolved toward higher math density algorithms. It is no different with RT and AI this time. As an example, per pixel Restir is more computationally intensive than bruteforce global illumination due to the additional calculations done per pixel, yet it achieves significantly higher quality results with the same number of samples. On device inference for neural compressed textures takes time as well, but it compresses textures another 8–10 times more than what is possible with the simple block compression.Then what after exhaustively applying all sorts of data compression/compaction schemes ? At that point you just have an economically expensive race to see who will first reach to integrating a vertically stacked memory architecture whilst dedicating ever more master compute die area to caches or register files ...
Did something change in the hardware news?What's are the odds that there is an XDNA block in the Pro now?
More cache ala Infinity Cache isn't necessarily better here. If you look at this Cheese and Chips article L1 hit rate is really low as is the Infinity Cache (top bar).In that case, it would've been lovely to get some Infinity Cache in the Pro for exactly this reason. Especially for 700 bloody quid.
From Cheese and Chips overlooking RDNA 4 in the llvm. This article was dated back in January 2024.So Cerny's "custom hardware for machine learning" is just some tweak to the CUs? I'd consider that false advertising.
Sparsity
Moving to lower precision data formats is one way to scale matrix multiplication performance beyond what process node and memory bandwidth improvements alone would allow. Specialized handling for sparse matrices is another way to dramatically improve performance. Matrices with a lot of zero elements are known as sparse matrices. Multiplying sparse matrices can involve a lot less math because any multiplication involving zero can be skipped. Storage and bandwidth consumption can be reduced too because the matrix can be stored in a compressed format.
RDNA 4 introduces new SWMMAC (Sparse Wave Matrix Multiply Accumulate) instructions to take advantage of sparsity. SWMMAC similarly does a C += A * B operation, but A is a sparse matrix stored in half of B’s size. A sparsity index is passed as a fourth parameter to help interpret A as a full size matrix. My interpretation of this is that the dimensions in the instruction mnemonic refer to stored matrix sizes. Thus a 16x16x32 SWMMAC instruction actually multiplies a 32×16 sparse matrix with a 16×32 dense one, producing a 32×32 result.
If I guessed right, SWMMAC instructions would be the same as their WMMA siblings, but produce a result matrix twice as long in each dimension.
Instruction Multiplied Matrices (A and B) Format Result/Accumulate Format V_SWMMAC_F32_16X16X32_F16 FP16
A: 16×16 stored/32×16 actual
B: 16×3232×32 FP32 V_SWMMAC_F32_16X16X32_BF16 BF16 FP32 V_SWMMAC_F16_16X16X32_F16 FP16 FP16 V_SWMMAC_BF16_16X16X32_BF16 BF16 BF16 V_SWMMAC_I32_16X16X32_IU8 INT8 INT32 V_SWMMAC_I32_16X16X32_IU4 INT4 INT32 V_SWMMAC_I32_16X16X64_IU4 INT4
A: 16×16 stored/32×16 actual
B: 16×6432×64 INT32 V_SWMMAC_F32_16X16X32_FP8_FP8 FP8 FP32 V_SWMMAC_F32_16X16X32_FP8_BF8 FP8 and BF8 FP32 V_SWMMAC_F32_16X16X32_BF8_FP8 BF8 and FP8 FP32 V_SWMMAC_F32_16X16X32_BF8_BF8 BF8 FP32
Of course there’s no way to infer performance changes from looking at LLVM code, but I wonder if AMD will invest in higher per-SIMD matrix multiplication performance in RDNA 4. RDNA 3’s WMMA instructions provide the same theoretical throughput as using dot product instructions.
Since SWMMAC takes a sparse matrix where only half the elements are stored, perhaps RDNA 4 can get a 2x performance increase from sparsity.[WMMA] instructions work over multiple cycles to compute the result matrix and internally use the DOT instructions
“RDNA 3” Instruction Set Architecture Reference Guide
Preordered
So now time to sell it for 2k and buy normal proI got one of the 30th Anniversary bundles
congrats. they are like 10K on ebayI got one of the 30th Anniversary bundles
Thanks for the input, I hadn't seen that article and there isnt much information about DLSS in the wild. If it is int8, then arent the more relevant numbers the dp4a versions, not wmma? So the RDNA3 lineup would be ~80 to 240 dpa4a TOPs? To be clear, I don't know this is correct, I am asking. I understand that not every op of a NN will be dp4a, but for a high res image model, I am speculating that they will dominate the run time cost. We also already know from XeSS that you can get good quality and performance on existing RDNA3 hardware. I'm not at all trying to suggest my interpretation is correct, I just don't see the numbers that show that a small model by 2024 standards would be too slow to be useful on modern hardware. The cited article states 2.5 ms overhead for the 2060 with ~115 TOPs int8 so that would be ~3.6 ms with the dp4a of the 7600, so on the order of 1 ms, or even 2 ms extra to be conservative. I don't see how that wouldnt provide a useful uplift at 4k output. And of course, again, we do see useful uplifts with XeSS.DLSS/XESS and PSSR(probaly) uses int8
Why int8 is most likely:
1)Xess providing comparable quality to dlss on intel ARC uses int8
https://www.tomshardware.com/news/intel-xess-technology-demo-and-overview
2)Int8 is accelerated by all nvidia GPUs from turing and above
3)Alex Battaglia said DLSS uses int8
Death Stranding PC: how next-gen AI upscaling beats native 4K
The concept of native resolution is becoming less and less relevant in the modern era of games and instead, image recon…www.eurogamer.net
The Int8 rate of RDNA2/3 is the same and it is very low compared to turing
2070=126topsHow to accelerate AI applications on RDNA 3 using WMMA
This blog is a quick how-to guide for using the WMMA feature with our RDNA 3 GPU architecture using a Hello World example.gpuopen.com
7900xt=103tops
Dlss-like NN is not possible on RDNA3 GPU with acceptable speed
Either Sony has made some other changes to the architecture besides the new RT blocks
Either some kind of coprocessor or a separate asic is used in the gpu
Why? The leak states that the model and buffers take 250 MB. This superficially seems similar to both DLSS and XeSS for the reasons I noted above. Is my inference about absolute memory read time per frame incorrect? Are modern games using far less than that per frame? I would have thought it would be much more.4)FP16 is too high, a huge blow to memory
Modern GPUs do not have infinite L0 L1 register cache. Therefore, the lowest possible accuracy is used for quantization depending on the neural networkWhy?
You are comparing the incomparable. DP4A NN has the worse quality and is not comparable to DLSSdp4a versions
These numbers are not correct. RDNA3 has not received any improvement in the execution of these instructions. Only bf16, fp16 wih mm multiplicationswould be ~80 to 240 dpa4a TOPs?
Lucky. I tried to snag a 30th anniversary bundle but no dice. I have 3 pros preordered to replace my ps5s. One for the living room, one for the racing setup and one for the bedroom but I might cancel one and keep 2. We shall see.I got one of the 30th Anniversary bundles
Damn. That's a lot of proLucky. I tried to snag a 30th anniversary bundle but no dice. I have 3 pros preordered to replace my ps5s. One for the living room, one for the racing setup and one for the bedroom but I might cancel one and keep 2. We shall see.
Thanks for the response and careful explanation. I dont want this to get off topic so I just want to reiterate that that my goal is to understand the compute requirements for an upscaler of DLSS quality, to try to understand how much perf that could leave for other quality options on the PS5pro. I made comparisons to RDNA3 and DP4A NN not to equate them but as worst case reference to build from, because we dont yet know the specifics for the PS5pro other than it has custom hardware that should (I hope) have better performance than RDNA3 and DP4A.Modern GPUs do not have infinite L0 L1 register cache. Therefore, the lowest possible accuracy is used for quantization depending on the neural network
You are comparing the incomparable. DP4A NN has the worse quality and is not comparable to DLSS
These numbers are not correct. RDNA3 has not received any improvement in the execution of these instructions. Only bf16, fp16 wih mm multiplications
Nvidia would think otherwise since their most optimized implementation of ReSTIR involves using SER to spill arguments to their L2 cache to reorder the threads. There's absolutely no math involved in that part of the process as it's ALL MEMORY operations!For the past 20+ years, GPUs and software have consistently evolved toward higher math density algorithms. It is no different with RT and AI this time. As an example, per pixel Restir is more computationally intensive than bruteforce global illumination due to the additional calculations done per pixel, yet it achieves significantly higher quality results with the same number of samples. On device inference for neural compressed textures takes time as well, but it compresses textures another 8–10 times more than what is possible with the simple block compression.
All of this is pure drivel coming from you since we still can't run the simplest of LLMs on many NPUs. Most of these applicable optimizations doesn't whitewash away the underlying fact that we have a memory problem ...When it comes to LLMs, the idea is similar - maximizing data reuse in caches and registers is key to achieving higher math density. This can be achieved by batching multiple prompts and processing them in parallel, or by performing multi token or speculative predictions at the single prompt granularity, leveraging additional parallelism within a single request. In the case of the new class of models — Large Reasoning Models — there is inherently infinite parallelism, as they can launch an infinite number of prompts over the same weights at each reasoning step, enabling them to evaluate large number of hypotheses in a single step.
I currently have a 4080 super and have decided not to upgrade to the 5090 as I can't see one PC exclusive worth upgrading for... There's no Cyberpunk equivalent on the horizon to get me hyped. As a result, I'm just redirecting that money. Once I sell my other PS5s, it's should be like $1000-$1500 CAD out of pocket depending on if I keep the 3rd pro. Then again, when the 5090 comes out, I might change my mind but I've sure scalpers will remove that option from the tableDamn. That's a lot of pro
Not quite how this works. You're assuming bandwidth is unlimited in these scenarios and it's a pure compute play when you're doing this calculations.3070Ti: 174 Int8 TOPs (1.77*48*2048)
ps5pro: 68 Int8 TOPs with RDNA3 wmma rates (2.2*60*512)
If DLSS takes ~1 ms on a 3070Ti, then a DLSS sized network would take 2.6 ms* on the CUs of the ps5pro, which seems like a good trade off when the difference between 1080p and 4k frametime is on the order of 10 ms. If DLSS takes 2 ms on the 3070Ti, it would take 5.2 ms on the ps5pro CUs, which now seems like a bad trade off.
*Is this naive extrapolation where lower level performance differences between dedicated tensor cores and CUs could manifest?
In general it seems like modern general compute hardware is closer to being a good tradeoff then the "conventional wisdom" believes, which I read as some obvious show stopper. That is, the conversation is entirely about the rate being much faster, without consideration for the absolute time taken nor how that time relates to the upscaling "window". But, for the point of this thread, I'm suggesting that the cost of the upscaler is probably not what is taking away other possible improvements (even in my worst case construction) because that seemed to be a theme of some of the earlier comments, which is why I went down this rabbit hole in the first place.