PS5 Pro *spawn

There was never any magic solution to RT or AI other than "more memory performance".
Not true at all. Memory has never been the sole or primary solution for any of the mentioned workloads, and there are countless ways to optimize performance through specialized logic or better designed software. This can be achieved by using more compact data structures, bruteforcing cache sizes, compressing data, quantizing data, performing multitoken or speculative predictions for LLMs, batching work, fusing kernels, and more. There are myriad optimizations available that can dramatically improve performance in both RT and AI without relying on faster memory. If memory were the only factor, it would be the component costing thousands of dollars, not GPUs.
 
Not true at all. Memory has never been the sole or primary solution for any of the mentioned workloads, and there are countless ways to optimize performance through specialized logic or better designed software. This can be achieved by using more compact data structures, bruteforcing cache sizes, compressing data, quantizing data, performing multitoken or speculative predictions for LLMs, batching work, fusing kernels, and more. There are myriad optimizations available that can dramatically improve performance in both RT and AI without relying on faster memory. If memory were the only factor, it would be the component costing thousands of dollars, not GPUs.
@Bold Then what after exhaustively applying all sorts of data compression/compaction schemes ? At that point you just have an economically expensive race to see who will first reach to integrating a vertically stacked memory architecture whilst dedicating ever more master compute die area to caches or register files ...

All RT or AI mostly exposes is whomever will reach the limits of their memory system first before the others do which isn't interesting so all the best wishes for consoles banking on something like HBM becoming mainstream/widespread at semi-reasonable prices ...
 
So Cerny's "custom hardware for machine learning" is just some tweak to the CUs? I'd consider that false advertising.
I mean, it may not appear to be a big customization, but I think It's a pretty significant tweak.
If XSX had dual issue support combined with sparsity, it too would be close to 300TOPS and be able to run these ML upscaling models. These customizations alone set 5Pro apart from XSX. Let alone the RT silicon differences.

I think it's a big get. If MS could release today a ML model on XSX today I'm sure they would have by now.
The question is whether the compute is the bottleneck here for XSX, and I think it is; bandwidth is equal to 5Pro, so the only reason there isn't a model, is likely that it cannot compute it in 2ms or less.
 
I mean, it may not appear to be a big customization, but I think It's a pretty significant tweak.
If XSX had dual issue support combined with sparsity, it too would be close to 300TOPS and be able to run these ML upscaling models. These customizations alone set 5Pro apart from XSX. Let alone the RT silicon differences.

I think it's a big get. If MS could release today a ML model on XSX today I'm sure they would have by now.
The question is whether the compute is the bottleneck here for XSX, and I think it is; bandwidth is equal to 5Pro, so the only reason there isn't a model, is likely that it cannot compute it in 2ms or less.
RDNA3.5 got plenty others ML improvements than sparse and dual issue. GPU L0 and L1 caches are also a lot better than XSX (that were already worse than PS5). And don't forget that you get some additionnal memory contention on that bandwidth when using CPU extensively on XSX, and CPU is now been pushed at 60fps in those latest games. Also If I am not mistaken XSX has not native int-8 instructions that were only added in RDNA4.

So yes that means ML performance on XSX is a long long way off.
 
RDNA3.5 got plenty others ML improvements than sparse and dual issue. GPU L0 and L1 caches are also a lot better than XSX (that were already worse than PS5). And don't forget that you get some additionnal memory contention on that bandwidth when using CPU extensively on XSX, and CPU is now been pushed at 60fps in those latest games. Also If I am not mistaken XSX has not native int-8 instructions that were only added in RDNA4.

So yes that means ML performance on XSX is a long long way off.
XSX has native hardware level INT8 and INT4. But no one knows what and when it will be used...
 
Also If I am not mistaken XSX has not native int-8 instructions that were only added in RDNA4.
Series consoles all support DP4A and has since full RDNA2, so int-8 native is supported.

I'm still not understanding this bit you keep bringing up about L0 and L1 being worse on XSX?

L0 is built into every Dual CU. Each Dual CU should have it's own L0
L1 is shared, but I haven't found any information at all anywhere on the size of that L1.
 
Digital Foundry about F1 24
Codemasters are continually pushing visual features on the PC version, despite the yearly release cadence, and there is therefore a lot of rendering technology that can be used for the PS5 Pro. The headline here is that the PS5 Pro has enough grunt to deliver a 4K 60Hz quality mode with multiple RT effects - DDGI (dynamic diffuse global illumination, previously seen in the PS5 version), plus RTAO, RT transparency and RT opaque reflections. In the right circumstances, this is an almost generational leap in image quality.
So looks like indeed rt is improved a lot. Its also second game with 8k mode (upsacled from 4k internal). Btw gt7 is 1440p internal during race (1296p during replayes).
 
Last edited:
As far as I understand it, arent tensor cores part of nvidia SMs? This is my own non-expert, possibly incorrect, out of domain understanding, but I think stuff like FlashAttention works so well because its designed so that tensor cores make use of the incredibly fast local shared memory of the SMs. So, could there be an advantage to being a part of the CU?

I'm not clear on why there is concern over memory bandwidth for an image model, because it can be "memory access parallel" (again, non-expert, i don't know if there is a better term for this) for each output pixel, and with 4k and 8k outputs thats a lot of parallelization. In fact, the leak says 2 ms of frametime but also that the memory requirement is 250 MB. This lines up with the DLSS/XESS dll sizes as well as the fact that VRAM doesnt blow up when running the upscaling models. At the supposed 576 GB/s mem bandwith of the pro, reading the model (if 250MB) only takes 0.4 ms, right?

We also know that the performance requirements of DLSS upscaling hasnt increased, even though overall compute capabilities of GPUs has increased, and is of course still useful on Turing, which have 52 (2060) to 108 (2080Ti) of tensor core 16b TFLOPS. That actually matches well with the general compute of the RDNA3 line-up of 43 (7600) - 123 (7900XTX) with the dual issue FP16 compute numbers. Of course because RDNA3 is overall faster than Turing, taking the same time to upscale would see less of an overall uplift. However, PSSR is probably always going to be used at a "4k performance mode" which gives the biggest "window" for uplift, whereas many people with RTX GPUs will be running at lower output resolutions and higher quality upscaling settings.

To get an idea of that impact, we can compare the ps5pro to the 3070Ti which has a similar avg perf uplift over the ps5 analog 6700 in the TPU charts (40 %) and it compares well to the performance of the 7700XT, which is a ps5pro "compute" analog. The 3070 Ti has 87 tensor TFLOPs (608 GB/s memory bandwidth), compared to the 67 of the ps5pro FP16 numbers, which is a 30 % uplift. If we look at a recent GPU performance roundup (TPUs 4070S review), we see that the 3070 Ti has an avg FPS at 1080p and 4k of 117 (8.5 ms) and 51 (19.6 ms), respectively. For the 7700 XT, the numbers are 125 (8) and 51 (19.6). So the 3070 Ti, which absolutely benefits from DLSS, has similar performance the to the 7700 XT (though yet to be seen if comparable to the ps5pro) but only has an extra 30 % more FLOPS for upscaling, which must only take up a fraction of the "saved" ~11 ms from rendering at a lower resolution. Thus, even if PSSR is running on the CUs at the dual issue rate, it should still not be too far off the performance of DLSS, with the big assumption that the model is a similar size (and using these naive back of the envelope estimates, obviously I am interested in hearing feedback with other takes on this). Hardware concurrency is an issue we dont have information on for the ps5pro, but isnt that ultimitely limited by the "pipeline concurrency", that is, all inputs are needed to start the model inference and at some point the finished output frame is needed for final postprocessing?
 
Digital Foundry
The good news here is that the performance in Dragon's Dogma 2 seems to have been improved, with the PS5 Pro's frame-rates in the 50s in city areas that are in the 30s to low 40s on base PS5. That means that the PS5 Pro is likely to be within the console's VRR window even at 60Hz, which should reduce judder and make for a smoother-feeling experience.
+ pssr so nice improvement taking into account almost same cpu
 
What else can Tensor like cores do in a gaming console outside upscaling? If after 2ms they’re just sat there waiting on the next frame then you're probably better off putting the time and transistor resources needed to build and connect a new block into just altering the CUs. It’s not as sexy, but then it’s a device to play games and not a machine learning blade system.
Why Nv keeps tensor cores in their GPU?

And we needn’t be sureprisd about custom hardware. Strict point already has NPU.
 
From Digital Foundry

So looks like indeed rt is improved a lot. Its also second game with 8k mode (upsacled from 4k internal). Btw gt7 is 1440p internal during race (1296p during replayes).

Digital Foundry

+ pssr so nice improvement taking into account almost same cpu
GT7 has 1440 RT with PSSR and it should be comparable with 4K?

Dragon Dogma is more interesting because it is limited by CPU so how can it improve so much in city?
 
Digital Foundry

+ pssr so nice improvement taking into account almost same cpu
Once they'll have patched the buggy RTAO (improvement from their AO solution on PS5) this could be the game showing the most dramatic improvements with better framerates (now in VRR window) + RTAO and IQ. Isn't it paradoxical when this game was supposed to be the only one not to have noticeable framerate improvements on Pro?
 
  • Like
Reactions: snc
On that subject if I am not mistaken Tensor Cores also share ressources with shaders on Nvidia cards, am I right?
According to NVIDIA, concurrent CUDA and Tensor operations are not supported on Volta (V100, Titan V), but it became supported with Turing (RTX 2000), but there are limitations based on workload and resource availability. The concurrency was limited by scheduling bottlenecks and resource contention between floating-point and Tensor operations.

Ampere (RTX 3000) improves on this through better scheduling algorithms, better load balancing and enhanced caching, Ada (RTX 4000) improved on this even further through an enhanced scheduler with better concurrency management. So the concurrency is vastly improving with each generation.

The main benefit of the tensor block remains that it churns through the machine learning code very very quickly, taking 1.5ms or less to finish processing, while the regular shader block would have taken significantly more.

There are two very informative threads about this on the NVIDIA dev forums.

 
Most of the games seem to choose between PSSR and setting improvements. I'm curious about Guerilla's new reconstruction tech. I wonder if they are doing anything interesting or if it's just the standard tech tailored specifically to their art.
 
Last edited:
Why Nv keeps tensor cores in their GPU?

And we needn’t be sureprisd about custom hardware. Strict point already has NPU.
Because they’re vital for high performance machine learning applications outside gaming.
I seem to remember their adoption in game upscaling was along the lines of Nv thinking, “we’ve got these cores, what can we use them for in a gaming system?”
 
Given the results of Guerilla Games non-PSSR upscaling in H:FW, not having dedicated ML upscaling hardware is actually a plus as it affords devs more flexibility. There's clearly still potential in non-ML upscaling, and maybe a 'soft ML' solution combining ML stuff with GG's or others' algorithms would be the ideal? Probably not a lot of devs investigating upscaling given existing off-the-shelf solutions, but perhaps GG, Insomniac, et al can pool their heads together and develop the art in another direction?

That also means the ML hardware is just a generic resource and not tied to PSSR, so devs could use it elsewhere, although in the spirit of compatibility probably not a lot beside perhaps some ML particle/visual effects.
 
tensor core 16b TFLOPS
DLSS/XESS and PSSR(probaly) uses int8
Why int8 is most likely:
1)Xess providing comparable quality to dlss on intel ARC uses int8
https://www.tomshardware.com/news/intel-xess-technology-demo-and-overview

2)Int8 is accelerated by all nvidia GPUs from turing and above

3)Alex Battaglia said DLSS uses int8

4)FP16 is too high, a huge blow to memory


The Int8 rate of RDNA2/3 is the same and it is very low compared to turing
2070=126tops
7900xt=103tops
Dlss-like NN is not possible on RDNA3 GPU with acceptable speed
Either Sony has made some other changes to the architecture besides the new RT blocks
Either some kind of coprocessor or a separate asic is used in the gpu
 
Last edited:
Given the results of Guerilla Games non-PSSR upscaling in H:FW, not having dedicated ML upscaling hardware is actually a plus as it affords devs more flexibility. There's clearly still potential in non-ML upscaling, and maybe a 'soft ML' solution combining ML stuff with GG's or others' algorithms would be the ideal? Probably not a lot of devs investigating upscaling given existing off-the-shelf solutions, but perhaps GG, Insomniac, et al can pool their heads together and develop the art in another direction?

That also means the ML hardware is just a generic resource and not tied to PSSR, so devs could use it elsewhere, although in the spirit of compatibility probably not a lot beside perhaps some ML particle/visual effects.

The main benefit of ML is it's ability to use lower input resolutions and still get good results.

There has been some really impressive none ML based upscaling in games over the years (Checkerboarding in Days Gone is a noteworthy example) but they require a much higher resolution input than an ML based upscaler would use.

So they are useful, but will require more performance.
 
Back
Top