As far as I understand it, arent tensor cores part of nvidia SMs? This is my own non-expert, possibly incorrect, out of domain understanding, but I think stuff like FlashAttention works so well because its designed so that tensor cores make use of the incredibly fast local shared memory of the SMs. So, could there be an advantage to being a part of the CU?
I'm not clear on why there is concern over memory bandwidth for an image model, because it can be "memory access parallel" (again, non-expert, i don't know if there is a better term for this) for each output pixel, and with 4k and 8k outputs thats a lot of parallelization. In fact, the leak says 2 ms of frametime but also that the memory requirement is 250 MB. This lines up with the DLSS/XESS dll sizes as well as the fact that VRAM doesnt blow up when running the upscaling models. At the supposed 576 GB/s mem bandwith of the pro, reading the model (if 250MB) only takes 0.4 ms, right?
We also know that the performance requirements of DLSS upscaling hasnt increased, even though overall compute capabilities of GPUs has increased, and is of course still useful on Turing, which have 52 (2060) to 108 (2080Ti) of tensor core 16b TFLOPS. That actually matches well with the general compute of the RDNA3 line-up of 43 (7600) - 123 (7900XTX) with the dual issue FP16 compute numbers. Of course because RDNA3 is overall faster than Turing, taking the same time to upscale would see less of an overall uplift. However, PSSR is probably always going to be used at a "4k performance mode" which gives the biggest "window" for uplift, whereas many people with RTX GPUs will be running at lower output resolutions and higher quality upscaling settings.
To get an idea of that impact, we can compare the ps5pro to the 3070Ti which has a similar avg perf uplift over the ps5 analog 6700 in the TPU charts (40 %) and it compares well to the performance of the 7700XT, which is a ps5pro "compute" analog. The 3070 Ti has 87 tensor TFLOPs (608 GB/s memory bandwidth), compared to the 67 of the ps5pro FP16 numbers, which is a 30 % uplift. If we look at a recent GPU performance roundup (TPUs 4070S review), we see that the 3070 Ti has an avg FPS at 1080p and 4k of 117 (8.5 ms) and 51 (19.6 ms), respectively. For the 7700 XT, the numbers are 125 (8) and 51 (19.6). So the 3070 Ti, which absolutely benefits from DLSS, has similar performance the to the 7700 XT (though yet to be seen if comparable to the ps5pro) but only has an extra 30 % more FLOPS for upscaling, which must only take up a fraction of the "saved" ~11 ms from rendering at a lower resolution. Thus, even if PSSR is running on the CUs at the dual issue rate, it should still not be too far off the performance of DLSS, with the big assumption that the model is a similar size (and using these naive back of the envelope estimates, obviously I am interested in hearing feedback with other takes on this). Hardware concurrency is an issue we dont have information on for the ps5pro, but isnt that ultimitely limited by the "pipeline concurrency", that is, all inputs are needed to start the model inference and at some point the finished output frame is needed for final postprocessing?