Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,134
    Likes Received:
    3,030
    Location:
    Finland
    But doesn't running FP16 on Tensor Cores mean that half of your FP16 FLOPS exists only if it's matrix multiplications? Or are they now versatile enough to run fast matrix multiplications and other operations, but if so, why drag the regular units along at all?
     
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,604
    Likes Received:
    648
    Location:
    New York
    A tensor matrix multiplication is just a bunch of scalar FMAs that expect the operands in matrix form. My linear algebra is very rusty but I assume this works by feeding tensors with scalar FMA operands represented as sparse matrices.

    That gives you 32 FP16 FMAs and the other 32 FMAs run on the 16 regular ALUs.
     
  3. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    389
    Likes Received:
    337
  4. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    751
    Likes Received:
    320
    Why else you would accumulate at FP32 if you would not compute the addition operands at FP32 ?

    The mixed precision FMA units for TF32 and FP16 are basically the same. The only difference is handling of the exponent which has 3 more bits for TF32 compared to FP16.
    The cost of FMA is in the number of bits in the mantisa, as silicon area of FMA is proportional to the square of the number of bits in the mantissa which are the same for TF32 and FP16.
    See also :
    "Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area."
    • float32: 24^2=576 (100%)
    • float16: 11^2=121 (21%)
    • bfloat16: 8^2=64 (11%)
     
    #204 Voxilla, Jun 1, 2020
    Last edited: Jun 1, 2020
  5. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    751
    Likes Received:
    320
    Nvidia was smart enough to add BF16 to the A100, have you really been ignorant of this fact?
     
  6. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    751
    Likes Received:
    320
  7. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,518
    Likes Received:
    878
    Location:
    France
    If it's doesn't reduce the memory footprint, what the goal ? Faster transfert from point A to point B ?
     
  8. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    499
    Likes Received:
    220
    Probably, sounds like lossless delta compression, but for compute, thus more bandwidth. Neat.
     
    BRiT and Rootax like this.
  9. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    127
    Likes Received:
    154
    Is this something which might help raytracing too? I often read, that RT is very cache performance sensitive, could this help or is it only useful for compute?
     
    Man from Atlantis likes this.
  10. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    692
    Likes Received:
    441
    Location:
    Slovenia
    It sort of does reduce footprint. It can keep data compressed in L2 so there is more cache available. Can't reduce footprint in main memory as you don't know in advance if output could be compressed or not.
     
  11. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,577
    Likes Received:
    2,296
    Ignorance is bliss, or is it inclusiveness? Including BF16 means supporting a currently established standard data format for clients to use, it would be negligence otherwise. There are likely scenarios where BF16 performance is sufficient, and in those situations the ability to maintain the status quo is Nvidia's advantage.

    Having the ability to switch to TF32 format and it's extra precision covers scenarios where BF16 is lacking, and gains in performance,cost or precision can be found using the new TF32. Once independent training and inference benchmark results appear we will have a better idea about those scenarios.
     
    A1xLLcqAgt0qc2RyMz0y likes this.
  12. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,577
    Likes Received:
    2,296
    DIVING DEEP INTO THE NVIDIA AMPERE GPU ARCHITECTURE
    May 28, 2020

    [​IMG]
    https://www.nextplatform.com/2020/05/28/diving-deep-into-the-nvidia-ampere-gpu-architecture/
     
  13. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,209
    Likes Received:
    5,634
    So data read from RAM is compressed before it's moved into L2, and then decompressed to L1?
     
  14. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    692
    Likes Received:
    441
    Location:
    Slovenia
    No. Data leaving SMs can be compressed prior to being written to L2 or memory. Afterwards if compute is accessing that data again it will be read in compressed form to L2. So you can save bandwidth on the way out of the GPU and on the way back in the GPU as well as increase available L2 cache size as data will only be decompressed when leaving L2 for L1.
     
    Lightman, w0lfram, CeeGee and 8 others like this.
  15. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,287
    Likes Received:
    3,546
  16. Krteq

    Joined:
    May 5, 2020
    Messages:
    6
    Likes Received:
    13
    That's a great joke/fake

    All those fins and fan have no purpose at all - there is no airflow allowed, fan blades are melted to a frame etc.

    Just some joker with 3D printer
     

    Attached Files:

  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,134
    Likes Received:
    3,030
    Location:
    Finland
    What do you mean with "fan blades ar emelted to a frame"?

    The image claims that the PCB is really short and one fan would be cooling from the "front side" of the graphics card and 2nd from the backside, there are fins visible under the supposed backside fan going cards lengthwise, even when the ones next to the fan serve no clear purpose

    Overall really doubtful about it too but..
     
    #217 Kaotik, Jun 6, 2020
    Last edited: Jun 6, 2020
    PSman1700 likes this.
  18. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    757
    Likes Received:
    88
    someone photoshopped the back of the card without foils
    [​IMG]

    cooling consept edited by me over the chiphell image, the memory arrangement and pcb detail is stand in no conformation of 256bit memory
    [​IMG]
     
    #218 Man from Atlantis, Jun 6, 2020
    Last edited: Jun 6, 2020
    Lightman, jayco, pharma and 1 other person like this.
  19. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,703
    Likes Received:
    903
    Why do i think of doom 3 seeing the top imagine :p Intresting design for the cooling, unseen in traditional GPU's so far, with the two-fan setup, one on each side. Air out the back of the GPU (out of a pc case) is going to be limited that way?
     
  20. jayco

    Veteran Regular

    Joined:
    Nov 18, 2006
    Messages:
    1,355
    Likes Received:
    713
    One good thing about the pandemic is that I'm saving a huge loads of money lol. Gonna need it.

    edit:mmm... if that airflow is correct, it's gonna shower the cpu/ram with a lot of hot air.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...