Intel XeSS upscaling

BRiT

(>• •)>⌐■-■ (⌐■-■)
Moderator
Legend
Supporter
DF Article @ https://www.eurogamer.net/digitalfo...vs-dlss-the-digital-foundry-technology-review

Intel's XeSS tested in depth vs DLSS - the Digital Foundry technology review​

A strong start for a new, more open AI upscaler.

Intel's debut Arc graphics cards are arriving soon and ahead of the launch, Digital Foundry was granted exclusive access to XeSS - the firm's promising upscaling technology, based on machine learning. This test - and indeed our recent interview with Intel - came about in the wake of our AMD FSR 2.0 vs DLSS vs native rendering analysis, where we devised a gauntlet of image quality scenarios to really put these new technologies through their paces. We suggested to Intel that we'd love to put XeSS to the test in a similar way and the company answered the challenge, providing us with pre-release builds and a top-of-the-line Arc A770 GPU to test them on.

XeSS is exciting stuff. It's what I consider to be a second generation upscaler. First-gen efforts, such as checkerboarding, DLSS 1.0 and various temporal super samplers attempted to make half resolution images look like full resolution images, which they achieved to various degrees of quality. Second generation upscalers such as DLSS 2.x, FSR 2.x and Epic's Temporal Super Resolution aim to reconstruct from quarter resolution. So in the case of 4K, the aim is to make a native-like image from just a 1080p base pixel count. XeSS takes its place alongside these technologies.

 
Awesome stuff. I can't wait to see how it stacks up to FSR 2.0. Not as good as DLSS but a solid effort nonetheless.
It doesn't matter how it stacks up to FSR 2.0.

Intel had the glorious idea to disable the DP4a path for cross vendor GPUs. Meaning it will run much slower. We all know how much FSR 2.0 adds to the rendering pipeline already, especially on weaker GPUs, this will be much worse rendering it unuseable on cross vendor GPUs that do not have as much compute performance as a 3070.

FSR 2.0 stays to be the much better option for AMD hardware, and DLSS of course for Nvidia hardware. XeSS has no place on any hardware other than their own, and it's absolutely their fault.
 
Worth remembering that DLSS in SoTTR isn't exactly new.
2.3.2 SDK was released about a year ago.

Also a more interesting comparison would be between DP4a XeSS and FSR2 - as these are the ones which will be available to the vast majority of gamers.
Considering that DP4a XeSS is using a less complex reconstruction routine the results here may be interesting.

Intel had the glorious idea to disable the DP4a path for cross vendor GPUs.
When did this happen? Alex shows XeSS running on a 3070 just fine.
Much more interesting option would be to make XeSS run on Nv's tensor h/w of course but this is somewhat unlikely thus far.
 
Worth remembering that DLSS in SoTTR isn't exactly new.
2.3.2 SDK was released about a year ago.

Also a more interesting comparison would be between DP4a XeSS and FSR2 - as these are the ones which will be available to the vast majority of gamers.
Considering that DP4a XeSS is using a less complex reconstruction routine the results here may be interesting.


When did this happen? Alex shows XeSS running on a 3070 just fine.
Much more interesting option would be to make XeSS run on Nv's tensor h/w of course but this is somewhat unlikely thus far.

They use a "standard" (less advanced) machine learning model, with Intel's integrated GPUs using a dp4a kernel and non-intel GPUs using kernel using technologies enabled by DX12's Shader Model 6.4.

Yes, a 3070 from the Ampere generation which has a ton of compute performance to its disposal anyway and is a fairly high end GPU. And it STILL adds 3.8 ms to the rendering pipeline which is a lot. We already see how FSR 2.0 runs much slower on lower end hardware compared to high end ones, and this will be much worse here.

On GPUs like a RX 6600, RTX 2060 Super or lower, where I consider upscaling most useful, XeSS will run even slower compared to the Ampere GPU. And it would run much faster if they would be using dp4a.
 



Yes, a 3070 from the Ampere generation which has a ton of compute performance to its disposal anyway and is a fairly high end GPU. And it STILL adds 3.8 ms to the rendering pipeline which is a lot. We already see how FSR 2.0 runs much slower on lower end hardware compared to high end ones, and this will be much worse here.

On GPUs like a RX 6600, RTX 2060 Super or lower, where I consider upscaling most useful, XeSS will run even slower compared to the Ampere GPU. And it would run much faster if they would be using dp4a.
I think this is a bit of a misunderstanding on DF's part - DP4a is what gets enabled in SM 6.4 on any h/w which supports such instructions. I doubt that they've created a separate version just for non-Intel h/w, and what version would that even be? An FP16 one? Or FP32? It would probably be so slow that even a 3090 would get performance decrease from such upscaling.
 
I think this is a bit of a misunderstanding on DF's part - DP4a is what gets enabled in SM 6.4 on any h/w which supports such instructions. I doubt that they've created a separate version just for non-Intel h/w, and what version would that even be? An FP16 one? Or FP32? It would probably be so slow that even a 3090 would get performance decrease from such upscaling.

Nope, Alex confirmed on Twitter.
Mind you, Alex is in contact with Intel so I don't think there's a misunderstanding from him here.

Also, do not let the seemingly small difference between 3.4 ms vs 3.8 ms on Arc 770 vs 3070 fool you, it's a drastic difference XeSS costs a lot on the 3070. Remember, the 3070 has MUCH more compute performance than the Arc GPU and latter has to run a more complex neural network as well, as the XMX path offers higher quality.

And yes, the SM 6.4 path will run at FP16, which is a lot slower than the lower precision INT8 mode DP4a path uses.
 
Nope, Alex confirmed on Twitter. https://twitter.com/i/web/status/1570060784717250560
Mind you, Alex is in contact with Intel so I don't think there's a misunderstanding from him here.
I do. Look at what SM 6.4 adds, it's basically just packed dot products, i.e. DP4a. There is literally no other reason to require it for XeSS than the DP4a support.
I'd give a 99.9% chance that non-Intel h/w runs the same DP4a XeSS code as Intel's non-XMX h/w.

Now Intel can of course run said DP4a code through their own ML API (OneAPI?) on their non-XMX h/w instead of using DX Compute. How much benefits in performance that would bring is unknown though.

Also, do not let the seemingly small difference between 3.4 ms vs 3.8 ms on Arc 770 vs 3070 fool you, it's a drastic difference XeSS costs a lot on the 3070. Remember, the 3070 has MUCH more compute performance than the Arc GPU and latter has to run a more complex neural network as well, as the XMX path offers higher quality.
MM h/w will obviously run MM math several times faster than even DP4a would run on a really fast FP32 GPU so that's not surprising and in itself does not suggest that such GPU isn't running the DP4a path.
 
I do. Look at what SM 6.4 adds, it's basically just packed dot products, i.e. DP4a. There is literally no other reason to require it for XeSS than the DP4a support.
I'd give a 99.9% chance that non-Intel h/w runs the same DP4a XeSS code as Intel's non-XMX h/w.

Now Intel can of course run said DP4a code through their own ML API (OneAPI?) on their non-XMX h/w instead of using DX Compute. How much benefits in performance that would bring is unknown though.


MM h/w will obviously run MM math several times faster than even DP4a would run on a really fast FP32 GPU so that's not surprising and in itself does not suggest that such GPU isn't running the DP4a path.
Then why do older architectures without DP4a/INT8 support Shader Model 6.4? Makes no sense. Maybe DP4a is not an requirement for SM 6.4 support.

Also, don't you think Alex checked which path XeSS is using by simply comparing the performance of XeSS between his RX 5700 (DP4a incompatible) and 2060 Super (DP4a compatible)?

I seriously don't think it's a misunderstanding. I fully trust Alex here.

Yes, of course. If the XMX cores were not in use, it would run much slower than 3.4 ms. That much is clear. But you are right that just from that performance figure alone, it's hard to tell if the 3070 is using the DP4a path or not.
 
The DP4a path is only used on Intel iGPUs. For every other GPU it's using SM 6.4, regardless if the GPU supports DP4a or not.

And I'd guess they'll be using one or both of these SM 6.4 operations:

uint32 dot4add_u8packed(uint32 a, uint32 b, uint32 acc);

int32 dot4add_i8packed(uint32 a, uint32 b, int32 acc);



These are (edit: packed) 4 element 8 bit dot product accumulate operations. How is this in essence not DP4a? And if you have hardware support to accelerate this why wouldn't you use it?

Edit: Surely DP4a instructions are in the hardware to accelerate exactly this kind of SM 6.4 (or equivalent Vulcan) operations?
 
SM6.4 version is NOT DP4a.

On the 3070, the penalty of XeSS is 3.8ms, with DLSS it's 2.1ms, with A770 it's 3.4ms.

So DLSS is almost two times faster than XeSS even when XeSS is accelerated by the ML cores on the A770.
 
Then why do older architectures without DP4a/INT8 support Shader Model 6.4? Makes no sense. Maybe DP4a is not an requirement for SM 6.4 support.
A GPU can still run the math in the absence of h/w support for packed DP intrinsics but with a 1/4 or 1/2 speed (i.e. with a regular FP/INT32 math). Such GPUs won't get the speed benefit but they should still be able to run the code.

Also, don't you think Alex checked which path XeSS is using by simply comparing the performance of XeSS between his RX 5700 (DP4a incompatible) and 2060 Super (DP4a compatible)?
I dunno. There was nothing on this in his video.
It would also probably be a better option to use the same Pascal or Turing GPU but with drivers with and without SM 6.4 support than compare two completely different cards.
 
So regarding a possible misunderstanding - Intel reached out to us to make sure we were communicating the 3 Kernels clearly. An older version of our video lumped incorrectly DP4a as iGPU and non-Intel GPU together.

So 2 Models; advanced and Standard

3 Kernels

XMX using advanced model
DP4a using standard model
SM 6.4 using standard model

This is how Intel communicated it to us.
 
So regarding a possible misunderstanding - Intel reached out to us to make sure we were communicating the 3 Kernels clearly. An older version of our video lumped incorrectly DP4a as iGPU and non-Intel GPU together.

So 2 Models; advanced and Standard

3 Kernels

XMX using advanced model
DP4a using standard model
SM 6.4 using standard model

This is how Intel communicated it to us.
This doesn't make much sense to me.
But whatever, once it will be out it will be easy to see which h/w runs what I think.
 
So regarding a possible misunderstanding - Intel reached out to us to make sure we were communicating the 3 Kernels clearly. An older version of our video lumped incorrectly DP4a as iGPU and non-Intel GPU together.

So 2 Models; advanced and Standard

3 Kernels

XMX using advanced model
DP4a using standard model
SM 6.4 using standard model

This is how Intel communicated it to us.
Alex later confirmed it, the SM6.4 path is not the same as DP4a path, it is slower.

Hey what do you know - Intel literally reached out to us again in the night for *further* clarification. I just tweeted about it.

So I think this means we will see varying levels of XeSS speed depending on the GPU generation and make... which is good.
 
It doesn't matter how it stacks up to FSR 2.0.

Intel had the glorious idea to disable the DP4a path for cross vendor GPUs. Meaning it will run much slower. We all know how much FSR 2.0 adds to the rendering pipeline already, especially on weaker GPUs, this will be much worse rendering it unuseable on cross vendor GPUs that do not have as much compute performance as a 3070.

FSR 2.0 stays to be the much better option for AMD hardware, and DLSS of course for Nvidia hardware. XeSS has no place on any hardware other than their own, and it's absolutely their fault.

I've seen this sentiment being mentioned with regards to XeSS but it seems to ignore the business reality of the market. Non of these upscaling solutions are directly monetized and so their value for the developer is how they push their respective developer's hardware over competitors (or blunt the advantage of their competitor). The same argument against XeSS in this respect applies to both FSR and DLSS.

The reality going forward is that each vendor's own solution will be optimal for each vendors respective users. AMD users will prefer FSR support. Intel users will prefer XeSS support. Nvidia users will prefer DLSS support. Therefore the best pragmatic option is actually to push for some common framework to streamline implementation of any number of solutions on the game developers side.

As an aside I feel in general some people in the graphics hardware (and PC hardware space in general) have some unrealistic expectations with respect to the vague concept of hardware vendor agnosticism.
 
Hey what do you know - Intel literally reached out to us again in the night for *further* clarification. I just tweeted about it.

So I think this means we will see varying levels of XeSS speed depending on the GPU generation and make... which is good.
Thank goodness! Now everything is perfectly clear and I can sleep peacefully today 🤣
 
Well done Intel, quality is here (FSR 2.0 is/was such a letdown for me). Now they have to sell cards so devs will implement it I guess...
 
Back
Top