Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. Kaotik

    Kaotik Drunk Member Legend

    But doesn't running FP16 on Tensor Cores mean that half of your FP16 FLOPS exists only if it's matrix multiplications? Or are they now versatile enough to run fast matrix multiplications and other operations, but if so, why drag the regular units along at all?
     
  2. trinibwoy

    trinibwoy Meh Legend

    A tensor matrix multiplication is just a bunch of scalar FMAs that expect the operands in matrix form. My linear algebra is very rusty but I assume this works by feeding tensors with scalar FMA operands represented as sparse matrices.

    That gives you 32 FP16 FMAs and the other 32 FMAs run on the 16 regular ALUs.
     
  3. OlegSH

    OlegSH Regular

  4. Voxilla

    Voxilla Regular

    Why else you would accumulate at FP32 if you would not compute the addition operands at FP32 ?

    The mixed precision FMA units for TF32 and FP16 are basically the same. The only difference is handling of the exponent which has 3 more bits for TF32 compared to FP16.
    The cost of FMA is in the number of bits in the mantisa, as silicon area of FMA is proportional to the square of the number of bits in the mantissa which are the same for TF32 and FP16.
    See also :
    "Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area."
    • float32: 24^2=576 (100%)
    • float16: 11^2=121 (21%)
    • bfloat16: 8^2=64 (11%)
     
    Last edited: Jun 1, 2020
  5. Voxilla

    Voxilla Regular

    Nvidia was smart enough to add BF16 to the A100, have you really been ignorant of this fact?
     
  6. Voxilla

    Voxilla Regular

  7. Rootax

    Rootax Veteran

    If it's doesn't reduce the memory footprint, what the goal ? Faster transfert from point A to point B ?
     
  8. Frenetic Pony

    Frenetic Pony Regular

    Probably, sounds like lossless delta compression, but for compute, thus more bandwidth. Neat.
     
    BRiT and Rootax like this.
  9. Samwell

    Samwell Newcomer

    Is this something which might help raytracing too? I often read, that RT is very cache performance sensitive, could this help or is it only useful for compute?
     
    Man from Atlantis likes this.
  10. MDolenc

    MDolenc Regular

    It sort of does reduce footprint. It can keep data compressed in L2 so there is more cache available. Can't reduce footprint in main memory as you don't know in advance if output could be compressed or not.
     
  11. pharma

    pharma Veteran

    Ignorance is bliss, or is it inclusiveness? Including BF16 means supporting a currently established standard data format for clients to use, it would be negligence otherwise. There are likely scenarios where BF16 performance is sufficient, and in those situations the ability to maintain the status quo is Nvidia's advantage.

    Having the ability to switch to TF32 format and it's extra precision covers scenarios where BF16 is lacking, and gains in performance,cost or precision can be found using the new TF32. Once independent training and inference benchmark results appear we will have a better idea about those scenarios.
     
    A1xLLcqAgt0qc2RyMz0y likes this.
  12. pharma

    pharma Veteran

    DIVING DEEP INTO THE NVIDIA AMPERE GPU ARCHITECTURE
    May 28, 2020

    [​IMG]
    https://www.nextplatform.com/2020/05/28/diving-deep-into-the-nvidia-ampere-gpu-architecture/
     
  13. Scott_Arm

    Scott_Arm Legend

    So data read from RAM is compressed before it's moved into L2, and then decompressed to L1?
     
  14. MDolenc

    MDolenc Regular

    No. Data leaving SMs can be compressed prior to being written to L2 or memory. Afterwards if compute is accessing that data again it will be read in compressed form to L2. So you can save bandwidth on the way out of the GPU and on the way back in the GPU as well as increase available L2 cache size as data will only be decompressed when leaving L2 for L1.
     
    Lightman, w0lfram, CeeGee and 8 others like this.
  15. DavidGraham

    DavidGraham Veteran

  16. Krteq

    Krteq Newcomer

    That's a great joke/fake

    All those fins and fan have no purpose at all - there is no airflow allowed, fan blades are melted to a frame etc.

    Just some joker with 3D printer
     

    Attached Files:

  17. Kaotik

    Kaotik Drunk Member Legend

    What do you mean with "fan blades ar emelted to a frame"?

    The image claims that the PCB is really short and one fan would be cooling from the "front side" of the graphics card and 2nd from the backside, there are fins visible under the supposed backside fan going cards lengthwise, even when the ones next to the fan serve no clear purpose

    Overall really doubtful about it too but..
     
    Last edited: Jun 6, 2020
    PSman1700 likes this.
  18. someone photoshopped the back of the card without foils
    [​IMG]

    cooling consept edited by me over the chiphell image, the memory arrangement and pcb detail is stand in no conformation of 256bit memory
    [​IMG]
     
    Last edited: Jun 6, 2020
  19. PSman1700

    PSman1700 Legend

    Why do i think of doom 3 seeing the top imagine :p Intresting design for the cooling, unseen in traditional GPU's so far, with the two-fan setup, one on each side. Air out the back of the GPU (out of a pc case) is going to be limited that way?
     
  20. One good thing about the pandemic is that I'm saving a huge loads of money lol. Gonna need it.

    edit:mmm... if that airflow is correct, it's gonna shower the cpu/ram with a lot of hot air.
     
Loading...

Share This Page

Loading...