AMD FSR antialiasing discussion

Kaotik · May 20, 2021

Gubbi said:
Since it's upscaling, it should be called subresolution.

Cheers

Yeah, just like DLSS would be DLUS, upsampling instead of supersampling.

PSman1700 said:
Oh well i ment a solution akin to DLSS/tensor cores.

Why do you bring tensor cores up on this? They're completely irrelevant to both DLSS and whatever AMD brings up.
Having matrix crunchers (NVIDIAs Tensor cores, AMDs Matrix cores) can make things run faster if the load fits to run on them, but it's not the only way to make 'em go and not necessarily the optimal way either.

PSman1700 · May 20, 2021

Kaotik said:
Why do you bring tensor cores up on this? They're completely irrelevant to both DLSS and whatever AMD brings up.
Having matrix crunchers (NVIDIAs Tensor cores, AMDs Matrix cores) can make things run faster if the load fits to run on them, but it's not the only way to make 'em go and not necessarily the optimal way either.

It was in relation to the consoles being mentioned, not dGPUs from AMD.

Kaotik · May 20, 2021

PSman1700 said:
It was in relation to the consoles being mentioned, not dGPUs from AMD.

Consoles, dGPUs, iGPUs, it doesn't matter which hardware it is. DLSS (current versions, doesn't apply to late 1.x versions) is running on tensor cores because the tensor cores were there anyway. It doesn't mean it's the optimal way to run even DLSS, let alone other scaling technologies.

PSman1700 · May 20, 2021

Kaotik said:
DLSS (current versions, doesn't apply to late 1.x versions) is running on tensor cores because the tensor cores were there anyway. It doesn't mean it's the optimal way to run even DLSS, let alone other scaling technologies.

Aha, sounds very strange, NV must be blatantly lying in that case. But it isnt the thread for it.
FSR being cross platform might not be as capable, thought what i do support is cross-platform functionality (like ray tracing is available on basically everything).

Kaotik · May 20, 2021

PSman1700 said:
Aha, sounds very strange, NV must be blatantly lying in that case. But it isnt the thread for it.
FSR being cross platform might not be as capable, thought what i do support is cross-platform functionality (like ray tracing is available on basically everything).

To clarify: It might run better on Tensors than CUDA cores, especially since it frees up CUDA cores to work on other things, but then CUDA cores don't have similar 4/1 INT8 and 8/1 INT4 support AMD has, it might run even better on those if the precisions are enough (at least if you disregard that you're occupying units which could be doing something else), we don't know because we don't really know how everything works under the hood. But tensors weren't made for DLSS and they're not required to run it either. Worth pointing out that Amperes more capable and faster Tensor cores don't offer proportional boost compared to Turing Tensor cores in DLSS.

I'm not sure why being crossplatform would have anything to do with how capable FidelityFX SR (I don't like the FSR really, I think I link "FS" too strongly to "Flight Simulator") will be. I mean, even DLSS could be crossplatform if they wanted to. General cores can do matrix crunching just like tensor/matrix cores do, they're just slower at it especially if they have to use full or half precision.

PSman1700 · May 20, 2021

Kaotik said:
I'm not sure why being crossplatform would have anything to do with how capable FidelityFX SR (I don't like the FSR really, I think I link "FS" too strongly to "Flight Simulator") will be. I mean, even DLSS could be crossplatform if they wanted to. General cores can do matrix crunching just like tensor/matrix cores do, they're just slower at it especially if they have to use full or half precision.

Absolutely, DLSS could be multiplatform but probably less performant on non RTX gpus (or forward). Specialized hardware, like the decompression units in consoles are going to be mor efficient but less flexible generally. Time will tell how all this is going to turn out. AMD has to offer something to counter DLSS tech in the pc space atleast, therefore we can be 100% sure they will come with a similar albeit different solution to the reconstruction (or upscaling?) problem sooner or later.

Bondrewd · May 20, 2021

Kaotik said:
especially since it frees up CUDA cores to work on other things

But TCs burn the shit out of regfile bandwidth (which is kinda the point).
It's just faster.

Deleted member 13524 · May 20, 2021

Bondrewd said:
But TCs burn the shit out of regfile bandwidth (which is kinda the point).
It's just faster.

I think @Kaotik 's point is that if the NN inference takes 1.5ms instead of 0.5ms when the whole process is saving 10ms from a rendered frame*, then it may not be worth spending die area on tensor cores instead of just adding more compute units that are more versatile.

*warning: completely made up numbers

iroboto · May 20, 2021

ToTTenTranz said:
I think @Kaotik 's point is that if the NN inference takes 1.5ms instead of 0.5ms when the whole process is saving 10ms from a rendered frame*, then it may not be worth spending die area on tensor cores instead of just adding more compute units that are more versatile.

*warning: completely made up numbers

I think if the goal is to use it just once per frame, and the approximate frame time performance is 16ms up to 33.33ms- going without matrix crunchers is probably sufficient. But if you're using it a lot, large network, or you want to get into the 100fps range, I think you're going to need to speed that part of the process up.

Which is fine for the target and product line of AMD products.

OlegSH · May 20, 2021

Kaotik said:
but then CUDA cores don't have similar 4/1 INT8 and 8/1 INT4 support AMD has

Are you sure about that?
dp4a and dp2a instuctions have been in PTX since Pascal.
Also Tensor Cores in Ampere support INT4 and INT8 ops at 4x and 2x speeds relative to FP16. Though, you can do only a very limited set of things in graphics with such low precision, i.e. it would probably work for some masks, but it surely would not be enough for complex processing.

Kaotik · May 20, 2021

OlegSH said:
Are you sure about that?
dp4a and dp2a instuctions have been in PTX since Pascal.

No, not 100% sure, but would be strange that they don't mention any speedups for them in whitepapers and such

Also Tensor Cores in Ampere support INT4 and INT8 ops at 4x and 2x speeds relative to FP16. Though, you can do only a very limited set of things in graphics with such low precision, i.e. it would probably work for some masks, but it surely would not be enough for complex processing.

Yes, there's a reason why I specified CUDA cores, which should be apparent when you don't cut half the sentence out.

Bondrewd · May 20, 2021

OlegSH said:
Are you sure about that?

Yeah.

OlegSH said:
dp4a and dp2a instuctions have been in PTX since Pascal

Made for GP102, literally.
Lol.

OlegSH · May 20, 2021

Bondrewd said:
Yeah

DP4A is explicitly listed in the ISA docs, so I tend to disagree.

Bondrewd said:
Made for GP102, literally.

And all other Pascal chips for inference, yes.

Bondrewd · May 20, 2021

OlegSH said:
DP4A is explicitly listed in the ISA docs, so I tend to disagree.

Eh.

OlegSH said:
And all other Pascal chips for inference, yes.

Pretty sure only GP102 did fancy dot product stuff.

hughJ · May 20, 2021

Why spend time making "pretty sure" assertions when you could just google it and know the answer.

pTmdfx · May 20, 2021

OlegSH said:
DP4A is explicitly listed in the ISA docs, so I tend to disagree.

And all other Pascal chips for inference, yes.

https://www.clear.rice.edu/comp422/resources/cuda/pdf/Pascal_Tuning_Guide.pdf

Page 3

However, compensating for reduced FP16 throughput, GP104 provides additional high-throughput INT8 support not available in GP100.

GP104 provides specialized instructions for two-way and four-way integer dot products.

Definitely not accelerated in GP100, and the acceleration appears to be GP104 exclusive.

hughJ · May 20, 2021

https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8

"For such applications, the latest Pascal GPUs (GP102, GP104, and GP106) introduce new 8-bit integer 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. "

"Keep in mind that DP2A and DP4A are available on Tesla, GeForce, and Quadro accelerators based on GP102, GP104, and GP106 GPUs, but not on the Tesla P100 (based on the GP100 GPU)."

Literally the first google search result (for "DP4A"). There's even discussion about it on this forum back from Pascal's launch period.

iroboto · May 20, 2021

hughJ said:
https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8

"For such applications, the latest Pascal GPUs (GP102, GP104, and GP106) introduce new 8-bit integer 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. "

"Keep in mind that DP2A and DP4A are available on Tesla, GeForce, and Quadro accelerators based on GP102, GP104, and GP106 GPUs, but not on the Tesla P100 (based on the GP100 GPU)."

Literally the first google search result (for "DP4A"). There's even discussion about it on this forum back from Pascal's launch period.

Weird. I have a 1070 and my 16bit floats are complete shit. 1/64 the performance. But you’re telling me 8bit performance will be good ?

OlegSH · May 21, 2021

pTmdfx said:
Definitely not accelerated in GP100, and the acceleration appears to be GP104 exclusive.

GP100 was for training, GP10X chips were ment for inference hence the dp4a.
But I was talking about Turing and Ampere where DP4A is still part of ISA.

pTmdfx · May 21, 2021

OlegSH said:
GP100 was for training, GP10X chips were ment for inference hence the dp4a.
But I was talking about Turing and Ampere where DP4A is still part of ISA.

PTX is an abstraction over the hardware, not the bare metal binary ISA of the underlying hardware. So the PTX compiler could emit an equivalent binary code sequence in order to “polyfill” DP2A/DP4A, even if DP2A/DP4A hardware acceleration ends up only in a couple blessed variants in the family.

It might not be viable for all kinds of new/special instructions, but DP2A/DP4A at least should be decomposable into ordinary integer & bitwise ops with little to no caveat, though obviously taking more time to complete.

AMD FSR antialiasing discussion

Kaotik

Drunk Member

PSman1700

Kaotik

Drunk Member

PSman1700

Kaotik

Drunk Member

PSman1700

Bondrewd

Deleted member 13524

Guest

iroboto

Daft Funk

OlegSH

Kaotik

Drunk Member

Bondrewd

OlegSH

Bondrewd

hughJ

pTmdfx

hughJ

iroboto

Daft Funk

OlegSH

pTmdfx

Similar threads