It's not officialy confirmed but rather deduced (navi10lite, tweet of lacking of some ml capabilities by sony worker, David Cage on xsx advantage in ml tough only thing ms talking connected with ml is int8/4), so yeah maybe my statement was too declarative. About importance of int8 nvidia dlss2 seems to use it:Why is there an assumption that the PS5's RDNA2 GPU lacks the same 4xINT8 / 8xINT4 throughput capabilities that have been present in all RDNA GPUs since Navi 14 which released over a year before both consoles?
Microsoft claimed they had custom optimizations for ML loads ("special hardware support for this specific scenario"), not that they invented 4xINT8 / 8XINT4 mixed dot products for RDNA ALUs that had been inside already released GPUs a year before. We're probably looking at hardware support for custom ML instructions.
Furthermore, are the faster INT8 and INT4 throughputs any useful for inference-based image upscaling? AFAIK INT4 and INT8 are useful for weight values in inference but how often are those going to appear when calculating for pixel color values?
Pixel shader calculations can go down to FP16 precision in some cases, but in most cases it's using FP32 and I doubt the framebuffer holds pixel values at less than 24bit. If they downgraded the pixel color values down to 8bit then it would probably look really bad.
Furthermore, I don't know of any document or official statement about DLSS1/2 that suggests it's using INT8 throughput. For all I know it's using FP16 Tensor FLOPs with FP32 accumulate (highest possible precision on the Tensor cores) which still has an enormous throughput on Turing and Ampere alike.
This seems to be for object identification when using a 3D stereo camera. I can't tell if it's for PSVR2, it could be for some AR games like the Playroom on PS4.
Nvidia's developer tool Nsight allows you to analyze games and their frames in order to expose performance hogs. The possibilities go far beyond seeing, for example, which effect requires which computing time. The tool also shows how well the GPU is being used and what resources are being used. Anyone who chases a DLSS 2.0 game through Nsight will see several things here: DLSS 2.0 is applied at the end of a frame and causes data traffic on the tensor cores (INT8 operations).
The author seems to be conflating Tensor Core utilization with INT8 throughput.About importance of int8 nvidia dlss2 seems to use it:
https://www.pcgameshardware.de/Nvid...warrior-5-DLSS-Test-Control-DLSS-Test-1346257
I don't have deeper knowledge unfortunatelyThe author seems to be conflating Tensor Core utilization with INT8 throughput.
Is Nsight able to declare the type of operations being done on the tensor cores? I thought it just measured utilization through the traffic on the caches and registers, like the CPU utilization graphics on windows/linux.
there is a high possibility that the entire DLSS is int8. Not saying it is, but we can encode entire networks to int8 now.The author seems to be conflating Tensor Core utilization with INT8 throughput.
we’re actually designing our next-gen silicon in such a way that it works great for playing games in the cloud, and also works very well for machine learning and other non-entertainment workloads. As a company like Microsoft, we can dual-purpose the silicon that we’re putting in. We have a consumer use for that silicon, and we have enterprise use for those blades as well.
nVidia themselves bragged that DLSS 2.0 is twice as fast as 1.0, which would fit perfectly into the narrative that it was fp16 in 1.0, and int8 in 2.0. But when it comes to the consoles, if the consoles support int8 with the implied speed gains, and a suitable upscaler can be made to run in int8, it probably should. When comparing nVidia's GPUs with AMDs, it's important to remember that DLSS is handled by the tensor cores. On AMD, you are using compute to do it. There is always going to be some sort of performance penalty for doing a complex upscale, but with AMD you are using the same logic as you are to render the graphics to begin with.there is a high possibility that the entire DLSS is int8. Not saying it is, but we can encode entire networks to int8 now.
There's been significant advances the last 12 months. It seems like the ideal fit for this type work.
yea maybe. I can't tell you because I haven't seen quantization benchmarks on AMD cards before. All the libraries we use are all CUDA based, so when Nvidia showcased their int8 quantization aware training techniques it started showing up in tensorflow and pyspark a couple months later. But you can run int8 quantization on anything (CPUs etc). I just haven't seen someone go as far as trying to benchmark it.nVidia themselves bragged that DLSS 2.0 is twice as fast as 1.0, which would fit perfectly into the narrative that it was fp16 in 1.0, and int8 in 2.0. But when it comes to the consoles, if the consoles support int8 with the implied speed gains, and a suitable upscaler can be made to run in int8, it probably should. When comparing nVidia's GPUs with AMDs, it's important to remember that DLSS is handled by the tensor cores. On AMD, you are using compute to do it. There is always going to be some sort of performance penalty for doing a complex upscale, but with AMD you are using the same logic as you are to render the graphics to begin with.
on the other hand dlss 1.9 calculated on cuda was faster tough far from same qualitynVidia themselves bragged that DLSS 2.0 is twice as fast as 1.0, which would fit perfectly into the narrative that it was fp16 in 1.0, and int8 in 2.0. But when it comes to the consoles, if the consoles support int8 with the implied speed gains, and a suitable upscaler can be made to run in int8, it probably should. When comparing nVidia's GPUs with AMDs, it's important to remember that DLSS is handled by the tensor cores. On AMD, you are using compute to do it. There is always going to be some sort of performance penalty for doing a complex upscale, but with AMD you are using the same logic as you are to render the graphics to begin with.
DLSS 1.9 was postprocessing algorithm designed to approximate results their AI research for DLSS 2.0 got, it didn't actually involve any neural networks or such.on the other hand dlss 1.9 calculated on cuda was faster tough far from same quality
But the PS5 doesnt have hardware ML implementation right?
So is this a custom software solution I suppose?
there is a high possibility that the entire DLSS is int8. Not saying it is, but we can encode entire networks to int8 now.
There's been significant advances the last 12 months. It seems like the ideal fit for this type work.
A ton of IOT/edge devices run very thin AIs. The int8 quantization and I suspect eventually a possible int4 quantization aware training (future?) are designed specfically for these low powered devices (like night vision, tracking objects with low powered devices, sensors etc).Do you happen to know if there's any mobile hardware that seems like a good candidate for onboard ML based image processing?
I know the Oculus Quest manages inside out tracking with an off the shelf SoC (Snapdragon 835 IIRC.) I've seen Google make a fuss about ML processing for some line of Pixel phones. So mobile hardware for ML does exist, but smartphone level SoC's strike me as a bit excessive for only performing inside out tracking.
So, after all that waffle, I suppose I'm asking if there are any suitable ML ASIC's out there?
I really wonder if we are ever going to find out about those differences, deviations, confirmations about whats in and whats not plus if they will have any significant impact
MS have designed XSX for a dual purpose: a gaming machine and for enterprises servers . This is most probably why they needed int4 and int8 in XSX, because they needed it for those enterprises servers.
https://twinfinite.net/2018/12/phil-spencer-buy-ea-next-gen/
Again: 4xINT8 / 8xINT4 was already in RDNA1 graphics cards per AMD's own whitepaper on RDNA1, just not on Navi 10.
The SeriesX's production definitely doesn't precede Navi 14's graphics cards, and neither does the PS5.
there is a high possibility that the entire DLSS is int8. Not saying it is, but we can encode entire networks to int8 now.
There's been significant advances the last 12 months. It seems like the ideal fit for this type work.
IIRC that just extended to having support for INT8/INT4 if the hardware design wished to add that in at the silicon level, right? I.e not all RDNA 1 (or RDNA 2 for that matter) designs need or will have that silicon present to support that type of math at the hardware level?
imho gpu sony was projected based on navi10 and while for example rt was essential to add it possible they just didn't see value in int8/4 (as for example you wrote that you didn't see usage in image reconstruction with so low precision)All of them have INT8/4 support.. even if they work by promoting the variables to FP16.
AFAIR from Locuza's and reddit analyses on the open source drivers, all the RDNA GPUs have rapid packed math on INT8 and INT4, except for the very first Navi 10.
That's apple's exclusive Navi 12, then Navi 14, Navi 21, Navi 22 and Navi 23.
With both Series SoCs having that too, why would Sony very specifically ask for that feature to be excluded?
They had the very first AMD GPU with FP16 RPM, but now asked to leave out this "RDNA 1.1" functionality? In 2016-2017 with deep neural networks blowing everywhere Sony thought ML wouldn't be a big thing somehow?
I get that we don't have any statement from Sony claiming their GPU does higher rate INT8 and INT4 (public information on the PS5's SoC is pretty scarce compared to the competition anyways), but assuming it doesn't have seems odd to me.
no worries. I was also careful to not suggest that they did implement this technique. Just something I noticed nvidia presenting on and then later finding it on tensorflow a couple months later. Somewhat aligns with DLSS2.0 timing.What advances are suggesting you can successfully use 8bit variables to calculate color values represented in at least 24bit?
For the record, I'm completely open to that possibility. I'm just yet to see any research or official statement claiming that you can do ML inference to calculate pixel color values using 8bit variables in a substancial portion of the neural network.
I'm aware of advances on neural networks and how variable precision has been going down for higher performance / lower power consumption, but on image processing that's mostly about object detection on e.g. autonomous vehicles (where there's no image output, only boxes and coordinates) and in industry processes for pattern recognition and predictive maintenance using low-precision values like temperature, vibration, power consumption, etc.
From what I've been studying about ML on industry applications, I still don't see how you get decent 24/32bit color values out of 8bit variables, regadless of using ML or any other method.
Again: "I don't see how" != "I don't think it's possible". Perhaps what DLLS2 does is e.g. applying a 8bit-per-pixel filter to a 32bit-per-pixel picture.
IIRC that just extended to having support for INT8/INT4 if the hardware design wished to add that in at the silicon level, right? I.e not all RDNA 1 (or RDNA 2 for that matter) designs need or will have that silicon present to support that type of math at the hardware level?
Either that, or Microsoft added yet a few other customizations to extend hardware support for INT8/INT4 on their APU.
I re read my post and I don't think I clearly made my point. On current nVidia hardware, DLSS is performed using the tensor cores. Separate parts from the normal rendering pipeline. There is a performance penalty because the process of upscaling takes time. But if you are rendering a 720p image in 12ms then as long as the scaling takes 4ms or less you can hit a 60hz target. If the same image takes 12ms on current AMD hardware, once you start scaling there is the same scaling penalty (maybe more if the int8 performance is lower on AMD hardware and the scaler is the same), but you are also using the same hardware you would be using to render the scene to begin with. That would imply a performance penalty to the original scene render.yea maybe. I can't tell you because I haven't seen quantization benchmarks on AMD cards before. All the libraries we use are all CUDA based, so when Nvidia showcased their int8 quantization aware training techniques it started showing up in tensorflow and pyspark a couple months later. But you can run int8 quantization on anything (CPUs etc). I just haven't seen someone go as far as trying to benchmark it.
I'm still waiting for the day to see AMD cards supported by these libraries, but right now its hard for me to estimate anything. I can only assume that if AMD releases a DL competitor for AI upscale, something similar to int8 quantization will be leveraged to improve speeds. The loss is very little and completely worth the speed benefit.
it ultimately depends on what it's trying to do I guess.I re read my post and I don't think I clearly made my point. On current nVidia hardware, DLSS is performed using the tensor cores. Separate parts from the normal rendering pipeline. There is a performance penalty because the process of upscaling takes time. But if you are rendering a 720p image in 12ms then as long as the scaling takes 4ms or less you can hit a 60hz target. If the same image takes 12ms on current AMD hardware, once you start scaling there is the same scaling penalty (maybe more if the int8 performance is lower on AMD hardware and the scaler is the same), but you are also using the same hardware you would be using to render the scene to begin with. That would imply a performance penalty to the original scene render.
My expectation for a solution coming from AMD/Xbox/PS5 when compared to DLSS would be that it will be a more simple, more universal form of upscaling that is slightly less performant with "close enough" quality but still worthwhile.