Playstation 5 [PS5] [Release November 12 2020]

Why is there an assumption that the PS5's RDNA2 GPU lacks the same 4xINT8 / 8xINT4 throughput capabilities that have been present in all RDNA GPUs since Navi 14 which released over a year before both consoles?
Microsoft claimed they had custom optimizations for ML loads ("special hardware support for this specific scenario"), not that they invented 4xINT8 / 8XINT4 mixed dot products for RDNA ALUs that had been inside already released GPUs a year before. We're probably looking at hardware support for custom ML instructions.


Furthermore, are the faster INT8 and INT4 throughputs any useful for inference-based image upscaling? AFAIK INT4 and INT8 are useful for weight values in inference but how often are those going to appear when calculating for pixel color values?
Pixel shader calculations can go down to FP16 precision in some cases, but in most cases it's using FP32 and I doubt the framebuffer holds pixel values at less than 24bit. If they downgraded the pixel color values down to 8bit then it would probably look really bad.
Furthermore, I don't know of any document or official statement about DLSS1/2 that suggests it's using INT8 throughput. For all I know it's using FP16 Tensor FLOPs with FP32 accumulate (highest possible precision on the Tensor cores) which still has an enormous throughput on Turing and Ampere alike.





This seems to be for object identification when using a 3D stereo camera. I can't tell if it's for PSVR2, it could be for some AR games like the Playroom on PS4.
It's not officialy confirmed but rather deduced (navi10lite, tweet of lacking of some ml capabilities by sony worker, David Cage on xsx advantage in ml tough only thing ms talking connected with ml is int8/4), so yeah maybe my statement was too declarative. About importance of int8 nvidia dlss2 seems to use it:
https://www.pcgameshardware.de/Nvid...warrior-5-DLSS-Test-Control-DLSS-Test-1346257
Nvidia's developer tool Nsight allows you to analyze games and their frames in order to expose performance hogs. The possibilities go far beyond seeing, for example, which effect requires which computing time. The tool also shows how well the GPU is being used and what resources are being used. Anyone who chases a DLSS 2.0 game through Nsight will see several things here: DLSS 2.0 is applied at the end of a frame and causes data traffic on the tensor cores (INT8 operations).
 
Last edited:
Again: 4xINT8 / 8xINT4 was already in RDNA1 graphics cards per AMD's own whitepaper on RDNA1, just not on Navi 10.

The SeriesX's production definitely doesn't precede Navi 14's graphics cards, and neither does the PS5.



The author seems to be conflating Tensor Core utilization with INT8 throughput.
Is Nsight able to declare the type of operations being done on the tensor cores? I thought it just measured utilization through the traffic on the caches and registers, like the CPU utilization graphics on windows/linux.
 
Last edited by a moderator:
The author seems to be conflating Tensor Core utilization with INT8 throughput.
Is Nsight able to declare the type of operations being done on the tensor cores? I thought it just measured utilization through the traffic on the caches and registers, like the CPU utilization graphics on windows/linux.
I don't have deeper knowledge unfortunately
 
MS have designed XSX for a dual purpose: a gaming machine and for enterprises servers . This is most probably why they needed int4 and int8 in XSX, because they needed it for those enterprises servers.

we’re actually designing our next-gen silicon in such a way that it works great for playing games in the cloud, and also works very well for machine learning and other non-entertainment workloads. As a company like Microsoft, we can dual-purpose the silicon that we’re putting in. We have a consumer use for that silicon, and we have enterprise use for those blades as well.

https://twinfinite.net/2018/12/phil-spencer-buy-ea-next-gen/
 
there is a high possibility that the entire DLSS is int8. Not saying it is, but we can encode entire networks to int8 now.
There's been significant advances the last 12 months. It seems like the ideal fit for this type work.
nVidia themselves bragged that DLSS 2.0 is twice as fast as 1.0, which would fit perfectly into the narrative that it was fp16 in 1.0, and int8 in 2.0. But when it comes to the consoles, if the consoles support int8 with the implied speed gains, and a suitable upscaler can be made to run in int8, it probably should. When comparing nVidia's GPUs with AMDs, it's important to remember that DLSS is handled by the tensor cores. On AMD, you are using compute to do it. There is always going to be some sort of performance penalty for doing a complex upscale, but with AMD you are using the same logic as you are to render the graphics to begin with.
 
nVidia themselves bragged that DLSS 2.0 is twice as fast as 1.0, which would fit perfectly into the narrative that it was fp16 in 1.0, and int8 in 2.0. But when it comes to the consoles, if the consoles support int8 with the implied speed gains, and a suitable upscaler can be made to run in int8, it probably should. When comparing nVidia's GPUs with AMDs, it's important to remember that DLSS is handled by the tensor cores. On AMD, you are using compute to do it. There is always going to be some sort of performance penalty for doing a complex upscale, but with AMD you are using the same logic as you are to render the graphics to begin with.
yea maybe. I can't tell you because I haven't seen quantization benchmarks on AMD cards before. All the libraries we use are all CUDA based, so when Nvidia showcased their int8 quantization aware training techniques it started showing up in tensorflow and pyspark a couple months later. But you can run int8 quantization on anything (CPUs etc). I just haven't seen someone go as far as trying to benchmark it.

I'm still waiting for the day to see AMD cards supported by these libraries, but right now its hard for me to estimate anything. I can only assume that if AMD releases a DL competitor for AI upscale, something similar to int8 quantization will be leveraged to improve speeds. The loss is very little and completely worth the speed benefit.
 
nVidia themselves bragged that DLSS 2.0 is twice as fast as 1.0, which would fit perfectly into the narrative that it was fp16 in 1.0, and int8 in 2.0. But when it comes to the consoles, if the consoles support int8 with the implied speed gains, and a suitable upscaler can be made to run in int8, it probably should. When comparing nVidia's GPUs with AMDs, it's important to remember that DLSS is handled by the tensor cores. On AMD, you are using compute to do it. There is always going to be some sort of performance penalty for doing a complex upscale, but with AMD you are using the same logic as you are to render the graphics to begin with.
on the other hand dlss 1.9 calculated on cuda was faster tough far from same quality
 
But the PS5 doesnt have hardware ML implementation right?
So is this a custom software solution I suppose?

It depends on what's doing the processing IMO. We could read this as the PS5 being the object which does so. But there's nothing there that precludes, for example, a headset performing inside out tracking via onboard ML hardware.

there is a high possibility that the entire DLSS is int8. Not saying it is, but we can encode entire networks to int8 now.
There's been significant advances the last 12 months. It seems like the ideal fit for this type work.

Do you happen to know if there's any mobile hardware that seems like a good candidate for onboard ML based image processing?

I know the Oculus Quest manages inside out tracking with an off the shelf SoC (Snapdragon 835 IIRC.) I've seen Google make a fuss about ML processing for some line of Pixel phones. So mobile hardware for ML does exist, but smartphone level SoC's strike me as a bit excessive for only performing inside out tracking.

So, after all that waffle, I suppose I'm asking if there are any suitable ML ASIC's out there?
 
Do you happen to know if there's any mobile hardware that seems like a good candidate for onboard ML based image processing?

I know the Oculus Quest manages inside out tracking with an off the shelf SoC (Snapdragon 835 IIRC.) I've seen Google make a fuss about ML processing for some line of Pixel phones. So mobile hardware for ML does exist, but smartphone level SoC's strike me as a bit excessive for only performing inside out tracking.

So, after all that waffle, I suppose I'm asking if there are any suitable ML ASIC's out there?
A ton of IOT/edge devices run very thin AIs. The int8 quantization and I suspect eventually a possible int4 quantization aware training (future?) are designed specfically for these low powered devices (like night vision, tracking objects with low powered devices, sensors etc).

Things that use reprogrammable AI models, will likely be with FPGA? Sounds costly though. I'm not sure how common 'generic' AI processors are available for edge/iot devices. But phones and laptops will run them.

PS5 has more than enough power to do this with FP16. The requirements on console are not how to say, _ultra_ requirements like PC benchmarks are. They can get away with a bunch of stuff and make AI work on it.

it would be ideal to have lower level native precision to get more performance, but even if it doesn't, it can still run int8 quantization and benefit from bandwidth reduction.

We should recognize that you can't just have tensor cores. Compute is still critical component here.
 
Last edited:
I really wonder if we are ever going to find out about those differences, deviations, confirmations about whats in and whats not plus if they will have any significant impact

Literally don't count on it, unless a game dev with access and experience on both platforms decides they don't care about working in the industry anymore and decide to reveal all on some burner account.

So...maybe in 2023.

MS have designed XSX for a dual purpose: a gaming machine and for enterprises servers . This is most probably why they needed int4 and int8 in XSX, because they needed it for those enterprises servers.



https://twinfinite.net/2018/12/phil-spencer-buy-ea-next-gen/

Yep; in addition for the Azure cloud purpose serving for Xcloud streaming of multiple One S instances on a single APU fitted in a blade, they're probably also going to use Series X APUs for raw data compute tasks and that's where the lower-precision math has more immediate benefits.

Sony is probably fitting some Azure racks with PS5 SoCs (including some things a few patents suggest), but they don't have a global cloud network and aren't involved in certain ML/AI/science/medical etc. tech business fields Microsoft is, so there was less of a need for them to go further than FP16 in terms of hardware support.

Again: 4xINT8 / 8xINT4 was already in RDNA1 graphics cards per AMD's own whitepaper on RDNA1, just not on Navi 10.

The SeriesX's production definitely doesn't precede Navi 14's graphics cards, and neither does the PS5.

IIRC that just extended to having support for INT8/INT4 if the hardware design wished to add that in at the silicon level, right? I.e not all RDNA 1 (or RDNA 2 for that matter) designs need or will have that silicon present to support that type of math at the hardware level?

Either that, or Microsoft added yet a few other customizations to extend hardware support for INT8/INT4 on their APU.
 
Last edited:
there is a high possibility that the entire DLSS is int8. Not saying it is, but we can encode entire networks to int8 now.
There's been significant advances the last 12 months. It seems like the ideal fit for this type work.

What advances are suggesting you can successfully use 8bit variables to calculate color values represented in at least 24bit?

For the record, I'm completely open to that possibility. I'm just yet to see any research or official statement claiming that you can do ML inference to calculate pixel color values using 8bit variables in a substancial portion of the neural network.

I'm aware of advances on neural networks and how variable precision has been going down for higher performance / lower power consumption, but on image processing that's mostly about object detection on e.g. autonomous vehicles (where there's no image output, only boxes and coordinates) and in industry processes for pattern recognition and predictive maintenance using low-precision values like temperature, vibration, power consumption, etc.


From what I've been studying about ML on industry applications, I still don't see how you get decent 24/32bit color values out of 8bit variables, regadless of using ML or any other method.

Again: "I don't see how" != "I don't think it's possible". Perhaps what DLLS2 does is e.g. applying a 8bit-per-pixel filter to a 32bit-per-pixel picture.
 
Last edited by a moderator:
  • Like
Reactions: snc
IIRC that just extended to having support for INT8/INT4 if the hardware design wished to add that in at the silicon level, right? I.e not all RDNA 1 (or RDNA 2 for that matter) designs need or will have that silicon present to support that type of math at the hardware level?

All of them have INT8/4 support.. even if they work by promoting the variables to FP16.
AFAIR from Locuza's and reddit analyses on the open source drivers, all the RDNA GPUs have rapid packed math on INT8 and INT4, except for the very first Navi 10.
That's apple's exclusive Navi 12, then Navi 14, Navi 21, Navi 22 and Navi 23.
With both Series SoCs having that too, why would Sony very specifically ask for that feature to be excluded?
They had the very first AMD GPU with FP16 RPM, but now asked to leave out this "RDNA 1.1" functionality? In 2016-2017 with deep neural networks blowing everywhere Sony thought ML wouldn't be a big thing somehow?

I get that we don't have any statement from Sony claiming their GPU does higher rate INT8 and INT4 (public information on the PS5's SoC is pretty scarce compared to the competition anyways), but assuming it doesn't have seems odd to me.
 
All of them have INT8/4 support.. even if they work by promoting the variables to FP16.
AFAIR from Locuza's and reddit analyses on the open source drivers, all the RDNA GPUs have rapid packed math on INT8 and INT4, except for the very first Navi 10.
That's apple's exclusive Navi 12, then Navi 14, Navi 21, Navi 22 and Navi 23.
With both Series SoCs having that too, why would Sony very specifically ask for that feature to be excluded?
They had the very first AMD GPU with FP16 RPM, but now asked to leave out this "RDNA 1.1" functionality? In 2016-2017 with deep neural networks blowing everywhere Sony thought ML wouldn't be a big thing somehow?

I get that we don't have any statement from Sony claiming their GPU does higher rate INT8 and INT4 (public information on the PS5's SoC is pretty scarce compared to the competition anyways), but assuming it doesn't have seems odd to me.
imho gpu sony was projected based on navi10 and while for example rt was essential to add it possible they just didn't see value in int8/4 (as for example you wrote that you didn't see usage in image reconstruction with so low precision)
 
What advances are suggesting you can successfully use 8bit variables to calculate color values represented in at least 24bit?

For the record, I'm completely open to that possibility. I'm just yet to see any research or official statement claiming that you can do ML inference to calculate pixel color values using 8bit variables in a substancial portion of the neural network.

I'm aware of advances on neural networks and how variable precision has been going down for higher performance / lower power consumption, but on image processing that's mostly about object detection on e.g. autonomous vehicles (where there's no image output, only boxes and coordinates) and in industry processes for pattern recognition and predictive maintenance using low-precision values like temperature, vibration, power consumption, etc.


From what I've been studying about ML on industry applications, I still don't see how you get decent 24/32bit color values out of 8bit variables, regadless of using ML or any other method.

Again: "I don't see how" != "I don't think it's possible". Perhaps what DLLS2 does is e.g. applying a 8bit-per-pixel filter to a 32bit-per-pixel picture.
no worries. I was also careful to not suggest that they did implement this technique. Just something I noticed nvidia presenting on and then later finding it on tensorflow a couple months later. Somewhat aligns with DLSS2.0 timing.
I can't fully answer any real technical questions about the technique, QAT is new to me, and we're already having trouble keeping up with pace honestly. None of the state of the art things are useful for us because quite frankly, we could use anything and it would be more helpful than where the company is today. We're only using BERT now and that's been around for an easy 1+ years.

You will see a slide that shows
float32 > int 8
Following:
All calculations are done in int8 and accumulated to 32bits. Then a rescale back to 8 bits.

The 8 to 32 accumulate and rescale back to 8 will be faster with mixed precision hardware support. This particular task is what this feature is designed to accelerate.

Int8 Quantized Aware Training:

I think it's easiest to follow through with the actual API on how to use it and what it can do. This video does a step through from how it works to what you can do with it. Worth while watch from the start.

Reference documentation from nvidia:
Improving INT8 Accuracy Using Quantization Aware Training and the NVIDIA Transfer Learning Toolkit | NVIDIA Developer Blog

GTC 2020: Toward INT8 Inference: Deploying Quantization-Aware Trained Networks using TensorRT | NVIDIA Developer
^^ requires nvidia developer login; free however.
 
Last edited:
IIRC that just extended to having support for INT8/INT4 if the hardware design wished to add that in at the silicon level, right? I.e not all RDNA 1 (or RDNA 2 for that matter) designs need or will have that silicon present to support that type of math at the hardware level?

Either that, or Microsoft added yet a few other customizations to extend hardware support for INT8/INT4 on their APU.

Yeah, there's a difference between supporting a data type and the range of instructions you have to use with them. Looking at the RDNA 2 Shader ISA document (of which I claim to understand very little!!) you can see this near the top:

https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf

"Feature Changes in RDNA2 Devices

Dot product ALU operations added accelerate inferencing and deep-learning:
◦V_DOT2_F32_F16 / V_DOT2C_F32_F16
◦V_DOT2_I32_I16 / V_DOT2_U32_U16
◦V_DOT4_I32_I8 / V_DOT4C_I32_I8
◦V_DOT4_U32_U8
◦V_DOT8_I32_I4
◦V_DOT8_U32_U4"

As you can see, these are additions since RDNA1, specifically to "accelerate inferencing and deep-learning".

Perhaps these are the specific, ML focused changes that MS requested (or there's some overlap). They're possibly what PS5 is lacking.
 
yea maybe. I can't tell you because I haven't seen quantization benchmarks on AMD cards before. All the libraries we use are all CUDA based, so when Nvidia showcased their int8 quantization aware training techniques it started showing up in tensorflow and pyspark a couple months later. But you can run int8 quantization on anything (CPUs etc). I just haven't seen someone go as far as trying to benchmark it.

I'm still waiting for the day to see AMD cards supported by these libraries, but right now its hard for me to estimate anything. I can only assume that if AMD releases a DL competitor for AI upscale, something similar to int8 quantization will be leveraged to improve speeds. The loss is very little and completely worth the speed benefit.
I re read my post and I don't think I clearly made my point. On current nVidia hardware, DLSS is performed using the tensor cores. Separate parts from the normal rendering pipeline. There is a performance penalty because the process of upscaling takes time. But if you are rendering a 720p image in 12ms then as long as the scaling takes 4ms or less you can hit a 60hz target. If the same image takes 12ms on current AMD hardware, once you start scaling there is the same scaling penalty (maybe more if the int8 performance is lower on AMD hardware and the scaler is the same), but you are also using the same hardware you would be using to render the scene to begin with. That would imply a performance penalty to the original scene render.

My expectation for a solution coming from AMD/Xbox/PS5 when compared to DLSS would be that it will be a more simple, more universal form of upscaling that is slightly less performant with "close enough" quality but still worthwhile.
 
I re read my post and I don't think I clearly made my point. On current nVidia hardware, DLSS is performed using the tensor cores. Separate parts from the normal rendering pipeline. There is a performance penalty because the process of upscaling takes time. But if you are rendering a 720p image in 12ms then as long as the scaling takes 4ms or less you can hit a 60hz target. If the same image takes 12ms on current AMD hardware, once you start scaling there is the same scaling penalty (maybe more if the int8 performance is lower on AMD hardware and the scaler is the same), but you are also using the same hardware you would be using to render the scene to begin with. That would imply a performance penalty to the original scene render.

My expectation for a solution coming from AMD/Xbox/PS5 when compared to DLSS would be that it will be a more simple, more universal form of upscaling that is slightly less performant with "close enough" quality but still worthwhile.
it ultimately depends on what it's trying to do I guess.
DLSS does 2 things in particular. The first is that it supersamples AA to 16x IIRC. The second stage is to upscale to 4K. That combination is 2 separate networks if I understand correctly, so the AMD version isn't necessarily required to do that. It can just do the AA or the upscale and save some time there.
The other thing to consider is that yes, for 16.6ms the AI upscale might be tight lets say it's 6ms to perform instead, but for something like 33.3ms it's more than enough time for it to do both. So heavy graphical showcases will still be able to run at 30fps since upscaling is a smaller fraction of a much larger frame time.

The quality of the AA and the scaler will ultimately come down to the model and technology (where in the engine it does its work). So, you can use the same hardware, but you're unlikely to generate the same thing as nvidia. Many have tried to replicate DLSS (facebook and others) using the same hardware and have gotten no where close. So by that point alone, yes, I would agree that the output won't be the same. It will likely be different and have it's own characteristics.
 
Back
Top