AMD FSR antialiasing discussion

Discussion in 'Architecture and Products' started by Deleted member 90741, May 20, 2021.

  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Yeah, just like DLSS would be DLUS, upsampling instead of supersampling.

    Why do you bring tensor cores up on this? They're completely irrelevant to both DLSS and whatever AMD brings up.
    Having matrix crunchers (NVIDIAs Tensor cores, AMDs Matrix cores) can make things run faster if the load fits to run on them, but it's not the only way to make 'em go and not necessarily the optimal way either.
     
    milk likes this.
  2. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,090
    It was in relation to the consoles being mentioned, not dGPUs from AMD.
     
  3. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Consoles, dGPUs, iGPUs, it doesn't matter which hardware it is. DLSS (current versions, doesn't apply to late 1.x versions) is running on tensor cores because the tensor cores were there anyway. It doesn't mean it's the optimal way to run even DLSS, let alone other scaling technologies.
     
    milk likes this.
  4. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,090
    Aha, sounds very strange, NV must be blatantly lying in that case. But it isnt the thread for it.
    FSR being cross platform might not be as capable, thought what i do support is cross-platform functionality (like ray tracing is available on basically everything).
     
  5. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    To clarify: It might run better on Tensors than CUDA cores, especially since it frees up CUDA cores to work on other things, but then CUDA cores don't have similar 4/1 INT8 and 8/1 INT4 support AMD has, it might run even better on those if the precisions are enough (at least if you disregard that you're occupying units which could be doing something else), we don't know because we don't really know how everything works under the hood. But tensors weren't made for DLSS and they're not required to run it either. Worth pointing out that Amperes more capable and faster Tensor cores don't offer proportional boost compared to Turing Tensor cores in DLSS.

    I'm not sure why being crossplatform would have anything to do with how capable FidelityFX SR (I don't like the FSR really, I think I link "FS" too strongly to "Flight Simulator") will be. I mean, even DLSS could be crossplatform if they wanted to. General cores can do matrix crunching just like tensor/matrix cores do, they're just slower at it especially if they have to use full or half precision.
     
    #25 Kaotik, May 20, 2021
    Last edited: May 20, 2021
    milk and Deleted member 13524 like this.
  6. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,090
    Absolutely, DLSS could be multiplatform but probably less performant on non RTX gpus (or forward). Specialized hardware, like the decompression units in consoles are going to be mor efficient but less flexible generally. Time will tell how all this is going to turn out. AMD has to offer something to counter DLSS tech in the pc space atleast, therefore we can be 100% sure they will come with a similar albeit different solution to the reconstruction (or upscaling?) problem sooner or later.
     
  7. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    But TCs burn the shit out of regfile bandwidth (which is kinda the point).
    It's just faster.
     
    PSman1700 likes this.
  8. I think @Kaotik 's point is that if the NN inference takes 1.5ms instead of 0.5ms when the whole process is saving 10ms from a rendered frame*, then it may not be worth spending die area on tensor cores instead of just adding more compute units that are more versatile.

    *warning: completely made up numbers
     
    Kaotik likes this.
  9. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    I think if the goal is to use it just once per frame, and the approximate frame time performance is 16ms up to 33.33ms- going without matrix crunchers is probably sufficient. But if you're using it a lot, large network, or you want to get into the 100fps range, I think you're going to need to speed that part of the process up.

    Which is fine for the target and product line of AMD products.
     
    pharma and PSman1700 like this.
  10. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    798
    Likes Received:
    1,624
    Are you sure about that?
    dp4a and dp2a instuctions have been in PTX since Pascal.
    Also Tensor Cores in Ampere support INT4 and INT8 ops at 4x and 2x speeds relative to FP16. Though, you can do only a very limited set of things in graphics with such low precision, i.e. it would probably work for some masks, but it surely would not be enough for complex processing.
     
    pharma and PSman1700 like this.
  11. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    No, not 100% sure, but would be strange that they don't mention any speedups for them in whitepapers and such
    Yes, there's a reason why I specified CUDA cores, which should be apparent when you don't cut half the sentence out.
     
  12. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Yeah.
    Made for GP102, literally.
    Lol.
     
  13. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    798
    Likes Received:
    1,624
    DP4A is explicitly listed in the ISA docs, so I tend to disagree.

    And all other Pascal chips for inference, yes.
     
    pharma and PSman1700 like this.
  14. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Eh.
    Pretty sure only GP102 did fancy dot product stuff.
     
  15. hughJ

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    861
    Likes Received:
    417
    Why spend time making "pretty sure" assertions when you could just google it and know the answer.
     
    DavidGraham, PSman1700 and DegustatoR like this.
  16. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    https://www.clear.rice.edu/comp422/resources/cuda/pdf/Pascal_Tuning_Guide.pdf

    Page 3
    Definitely not accelerated in GP100, and the acceleration appears to be GP104 exclusive.
     
    #36 pTmdfx, May 20, 2021
    Last edited: May 20, 2021
    T2098 and Deleted member 90741 like this.
  17. hughJ

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    861
    Likes Received:
    417
    https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8

    "For such applications, the latest Pascal GPUs (GP102, GP104, and GP106) introduce new 8-bit integer 4-element vector dot product (DP4A) and 16-bit 2-element vector dot product (DP2A) instructions. "

    "Keep in mind that DP2A and DP4A are available on Tesla, GeForce, and Quadro accelerators based on GP102, GP104, and GP106 GPUs, but not on the Tesla P100 (based on the GP100 GPU)."

    Literally the first google search result (for "DP4A"). There's even discussion about it on this forum back from Pascal's launch period.
     
    T2098, Krteq, DegustatoR and 2 others like this.
  18. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    Weird. I have a 1070 and my 16bit floats are complete shit. 1/64 the performance. But you’re telling me 8bit performance will be good ?
     
    Deleted member 90741, pTmdfx and BRiT like this.
  19. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    798
    Likes Received:
    1,624
    GP100 was for training, GP10X chips were ment for inference hence the dp4a.
    But I was talking about Turing and Ampere where DP4A is still part of ISA.
     
    T2098 likes this.
  20. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    PTX is an abstraction over the hardware, not the bare metal binary ISA of the underlying hardware. So the PTX compiler could emit an equivalent binary code sequence in order to “polyfill” DP2A/DP4A, even if DP2A/DP4A hardware acceleration ends up only in a couple blessed variants in the family.

    It might not be viable for all kinds of new/special instructions, but DP2A/DP4A at least should be decomposable into ordinary integer & bitwise ops with little to no caveat, though obviously taking more time to complete.
     
    #40 pTmdfx, May 21, 2021
    Last edited: May 21, 2021
    Krteq and Deleted member 90741 like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...