AMD FSR upscaling

  • Thread starter Deleted member 90741
  • Start date
Since it's upscaling, it should be called subresolution.

Cheers
Yeah, just like DLSS would be DLUS, upsampling instead of supersampling.

Oh well i ment a solution akin to DLSS/tensor cores.
Why do you bring tensor cores up on this? They're completely irrelevant to both DLSS and whatever AMD brings up.
Having matrix crunchers (NVIDIAs Tensor cores, AMDs Matrix cores) can make things run faster if the load fits to run on them, but it's not the only way to make 'em go and not necessarily the optimal way either.
 
Why do you bring tensor cores up on this? They're completely irrelevant to both DLSS and whatever AMD brings up.
Having matrix crunchers (NVIDIAs Tensor cores, AMDs Matrix cores) can make things run faster if the load fits to run on them, but it's not the only way to make 'em go and not necessarily the optimal way either.

It was in relation to the consoles being mentioned, not dGPUs from AMD.
 
It was in relation to the consoles being mentioned, not dGPUs from AMD.
Consoles, dGPUs, iGPUs, it doesn't matter which hardware it is. DLSS (current versions, doesn't apply to late 1.x versions) is running on tensor cores because the tensor cores were there anyway. It doesn't mean it's the optimal way to run even DLSS, let alone other scaling technologies.
 
DLSS (current versions, doesn't apply to late 1.x versions) is running on tensor cores because the tensor cores were there anyway. It doesn't mean it's the optimal way to run even DLSS, let alone other scaling technologies.

Aha, sounds very strange, NV must be blatantly lying in that case. But it isnt the thread for it.
FSR being cross platform might not be as capable, thought what i do support is cross-platform functionality (like ray tracing is available on basically everything).
 
Aha, sounds very strange, NV must be blatantly lying in that case. But it isnt the thread for it.
FSR being cross platform might not be as capable, thought what i do support is cross-platform functionality (like ray tracing is available on basically everything).
To clarify: It might run better on Tensors than CUDA cores, especially since it frees up CUDA cores to work on other things, but then CUDA cores don't have similar 4/1 INT8 and 8/1 INT4 support AMD has, it might run even better on those if the precisions are enough (at least if you disregard that you're occupying units which could be doing something else), we don't know because we don't really know how everything works under the hood. But tensors weren't made for DLSS and they're not required to run it either. Worth pointing out that Amperes more capable and faster Tensor cores don't offer proportional boost compared to Turing Tensor cores in DLSS.

I'm not sure why being crossplatform would have anything to do with how capable FidelityFX SR (I don't like the FSR really, I think I link "FS" too strongly to "Flight Simulator") will be. I mean, even DLSS could be crossplatform if they wanted to. General cores can do matrix crunching just like tensor/matrix cores do, they're just slower at it especially if they have to use full or half precision.
 
Last edited:
I'm not sure why being crossplatform would have anything to do with how capable FidelityFX SR (I don't like the FSR really, I think I link "FS" too strongly to "Flight Simulator") will be. I mean, even DLSS could be crossplatform if they wanted to. General cores can do matrix crunching just like tensor/matrix cores do, they're just slower at it especially if they have to use full or half precision.

Absolutely, DLSS could be multiplatform but probably less performant on non RTX gpus (or forward). Specialized hardware, like the decompression units in consoles are going to be mor efficient but less flexible generally. Time will tell how all this is going to turn out. AMD has to offer something to counter DLSS tech in the pc space atleast, therefore we can be 100% sure they will come with a similar albeit different solution to the reconstruction (or upscaling?) problem sooner or later.
 
But TCs burn the shit out of regfile bandwidth (which is kinda the point).
It's just faster.

I think @Kaotik 's point is that if the NN inference takes 1.5ms instead of 0.5ms when the whole process is saving 10ms from a rendered frame*, then it may not be worth spending die area on tensor cores instead of just adding more compute units that are more versatile.

*warning: completely made up numbers
 
I think @Kaotik 's point is that if the NN inference takes 1.5ms instead of 0.5ms when the whole process is saving 10ms from a rendered frame*, then it may not be worth spending die area on tensor cores instead of just adding more compute units that are more versatile.

*warning: completely made up numbers
I think if the goal is to use it just once per frame, and the approximate frame time performance is 16ms up to 33.33ms- going without matrix crunchers is probably sufficient. But if you're using it a lot, large network, or you want to get into the 100fps range, I think you're going to need to speed that part of the process up.

Which is fine for the target and product line of AMD products.
 
but then CUDA cores don't have similar 4/1 INT8 and 8/1 INT4 support AMD has
Are you sure about that?
dp4a and dp2a instuctions have been in PTX since Pascal.
Also Tensor Cores in Ampere support INT4 and INT8 ops at 4x and 2x speeds relative to FP16. Though, you can do only a very limited set of things in graphics with such low precision, i.e. it would probably work for some masks, but it surely would not be enough for complex processing.
 
Are you sure about that?
dp4a and dp2a instuctions have been in PTX since Pascal.
No, not 100% sure, but would be strange that they don't mention any speedups for them in whitepapers and such
Also Tensor Cores in Ampere support INT4 and INT8 ops at 4x and 2x speeds relative to FP16. Though, you can do only a very limited set of things in graphics with such low precision, i.e. it would probably work for some masks, but it surely would not be enough for complex processing.
Yes, there's a reason why I specified CUDA cores, which should be apparent when you don't cut half the sentence out.
 
DP4A is explicitly listed in the ISA docs, so I tend to disagree.


And all other Pascal chips for inference, yes.
https://www.clear.rice.edu/comp422/resources/cuda/pdf/Pascal_Tuning_Guide.pdf

Page 3
However, compensating for reduced FP16 throughput, GP104 provides additional high-throughput INT8 support not available in GP100.

GP104 provides specialized instructions for two-way and four-way integer dot products.

Definitely not accelerated in GP100, and the acceleration appears to be GP104 exclusive.
 
Last edited:
https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8

"For such applications, the latest Pascal GPUs (GP102, GP104, and GP106) introduce new 8-bit integer 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. "

"Keep in mind that DP2A and DP4A are available on Tesla, GeForce, and Quadro accelerators based on GP102, GP104, and GP106 GPUs, but not on the Tesla P100 (based on the GP100 GPU)."

Literally the first google search result (for "DP4A"). There's even discussion about it on this forum back from Pascal's launch period.
 
https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8

"For such applications, the latest Pascal GPUs (GP102, GP104, and GP106) introduce new 8-bit integer 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. "

"Keep in mind that DP2A and DP4A are available on Tesla, GeForce, and Quadro accelerators based on GP102, GP104, and GP106 GPUs, but not on the Tesla P100 (based on the GP100 GPU)."

Literally the first google search result (for "DP4A"). There's even discussion about it on this forum back from Pascal's launch period.
Weird. I have a 1070 and my 16bit floats are complete shit. 1/64 the performance. But you’re telling me 8bit performance will be good ?
 
GP100 was for training, GP10X chips were ment for inference hence the dp4a.
But I was talking about Turing and Ampere where DP4A is still part of ISA.
PTX is an abstraction over the hardware, not the bare metal binary ISA of the underlying hardware. So the PTX compiler could emit an equivalent binary code sequence in order to “polyfill” DP2A/DP4A, even if DP2A/DP4A hardware acceleration ends up only in a couple blessed variants in the family.

It might not be viable for all kinds of new/special instructions, but DP2A/DP4A at least should be decomposable into ordinary integer & bitwise ops with little to no caveat, though obviously taking more time to complete.
 
Last edited:
Back
Top