Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

  1. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,442
    Likes Received:
    181
    Location:
    Chania
    On a sidenote, no wonder NVIDIA "complained" years ago about HBM not scaling as IHVs would want it to. Those wondering about the lack of way more FP32 performance within the current process/manufacturing/die area and overall bandwidth boundaries (amongst many others of course) IHVs obviously try to find the best possible balance for each target market.

    Ampere isn't a mainstream consumer product and I'd be very surprised if Turing's successor will come with HBM.
     
  2. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    747
    Likes Received:
    318
    To explain once more the difference between FP32 and TF32.
    TF32 does matrix multiplication with matrix values being 19 bit numbers, that is with FP19 precision.
    Hence the tensor cores with TF32 can not do matrix multiplication at FP32 precision.

    For AI training FP19 can be enough, even BF16 can be enough.
    Google TPU2/3 is based on BF16. Also the A100 supports BF16.
    BF16 reduces memory/cache storage by half and increases bandwidth efficiency by a factor 2 compared to TF32.
     
    #162 Voxilla, May 27, 2020
    Last edited: May 27, 2020
  3. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    947
    Likes Received:
    46
    Location:
    LA, California
    Out of curiosity - are your HPC codes typically limited by ALU throughput, or is increasing memory (main/cache/shared) bandwidth per FLOP more important?
     
  4. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    404
    Likes Received:
    431
    You forget to say that TF32 operates on FP32 input, all internal accumulators are FP32, and output is FP32. In training, they are no difference between FP32 and TF32. That's why Nvidia totally replaced FP32 by TF32.

    https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
    Sure, people have doubts now, but soon, users will compare TF32 and FP32 to see that it gives exactly the same output for up to 10 times faster training (up to 20 times with sparsity).
     
    #164 xpea, May 27, 2020
    Last edited: May 27, 2020
  5. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    747
    Likes Received:
    318
    The fact that TF32 stores FP19 values with 32 bit is indeed a waste of memory bandwidth and capacity.
    When those 32-bit values are loaded in the tensor cores, the first thing that happens is that 13 unused bits are thrown away.
     
  6. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    388
    Likes Received:
    331
    Is not addition of biases matrix afterwards comes at full FP32 precision? (so that output fully occupies 32 bits) Not sure how it helps with convergence in NNs, but I heard it helps quite a bit
     
    #166 OlegSH, May 27, 2020
    Last edited: May 27, 2020
  7. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    747
    Likes Received:
    318
    The main thing that is important IMHO is that the accumulation happens at FP32 precision, once that is done the result output can be reduced back to lower precision optionally after adding bias. Either the output is fed back into tensor cores which use only FP19, or a non-linearity like relu, sigmoid, tanh is applied requiring only low precision input and output.
    In any case in principle it is not be needed to store back the matrices at 32 bit when the tensor cores only use 19 bit of the matrix values, as storing them at 19 bit would be sufficient.Some hardware based compression/decompression could have enabled that. (or maybe this mysterious compute data compression can do this?)
     
    #167 Voxilla, May 27, 2020
    Last edited: May 27, 2020
  8. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    388
    Likes Received:
    331
    A good question.
    HPC folks I know are quite positive about changes in Ampere because their tasks are bandwidth bound in many cases.
    They were also quite impressed by the new L2 cache architecture with cache residency control, they said they already had had pipelines that can greatly benefit from on-chip producer-consumer queues in L2.
    Asynchronous barriers and copy instructions + new warp reduction ops were praised too.

    I don't know why folks here think that NVIDIA had not done hell a lot of profiling for Ampere to ensure that it's great in traditional compute.
    Also, all the new features are likely based on feedback from devs, hence most of devs I know are very positive about changes in Ampere.
     
    Alexko, pharma, nnunn and 1 other person like this.
  9. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    404
    Likes Received:
    431
    frankly, I like this kind of waste that provides 10 times the performance :lol:
    More seriously, it's obviously for backward compatibility with FP32. User doesn't have to change anything in the set of data to get instant speed up
     
  10. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    747
    Likes Received:
    318
    I like the no waste of BF16 that does 20 times better (obviously also compared to V100 non tensor core)
    And it's not just about speed, If you waste memory and your big model can not fit in memory you can not even train it.
     
  11. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,532
    Likes Received:
    2,217
    The performance benchmarks can't come soon enough! There should be some very revealing benchmark comparisons since many are currently running "mature" training and inference models.
     
  12. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    404
    Likes Received:
    431
    Agree. Speed is one variable to consider. Accuracy is another one and BF16 doesn't provide enough precision for many networks...
     
  13. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    747
    Likes Received:
    318
    I asked already before, without reply:
    provide some reference papers that show BF16 would not be sufficient
     
  14. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    36
    Likes Received:
    30
    For us, it's increasing memory bandwidth.

    Lots of HPC jobs are famously "bandwidth limited". For these, bandwidth per FLOP determines performance. So while it's nice to know our new A100 cards will offer "19.5 Tflops of FP64", for us the problem is how to feed such feisty cores. (1.6 Tbps HBM2 does help!)
     
    psurge, CarstenS and pharma like this.
  15. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,084
    Likes Received:
    945
    Location:
    Earth
    Maybe your experience is different than mine. I know quite many people who work with dnn's. Gold standard is to implement training with fp32 . Then try to optimize to use fp16 or even lower precision if possible. Some layers/networks this works out, some it doesn't. A lot of the tensor accelerators do multiplies in lower precision format but input, accumulation and output is in higher precision. tf32 multiplies in lower precision(19bit) but accumulation, input, output is fp32. It's pretty good compromise between quality and performance. If tf32 can replace fp32 in training that is a very big boost. For tf32 input/output is fp32 it's a drop in replacement for fp32. From network developer/scientist perspective it just work without any code changes(albeit the precision can be worse than fp32).

    Inference often can be lower precision than training. There the holy grail is to get to int8 or even int4 if possible. fp32 for inference can be a reference but it's very likely some lower precision format is good enough.

    It's typical to see different hw solution for training and inference as the requirement in both cases is different enough that creating special silicon gives advantage(cost, power consumption).

    If one knows the use case exactly the end result can be something like google tpu or what tesla uses in their cars. On the other hand if one is doing generic product for datacenter that wide variety of customers want to use then flexibility is required to match reasonable amount of different use cases. More flexible you go less you have chance to create small and optimal solution. i.e there is space for very specific accelerators, very generic processors(cpu) and possibly something in between(gpu?).
     
    #175 manux, May 27, 2020
    Last edited: May 27, 2020
    pharma and DavidGraham like this.
  16. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,934
    Likes Received:
    2,264
    Location:
    Germany
    Interesting, is that mainly main memory bandwidth or do things like the large and fast (7,2 TByte/s) L2 cache for data reuse as well as async. copy for taking pressure from L1 and RF help was well?
     
  17. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    747
    Likes Received:
    318
    Still doesn't answer the question with solid paper evidence showing BF16 would not be sufficient for training.
    It's quite strange how Nvidia tries to promote TF32 and doesn't talk about the benefits of BF16, as does your reply.
    I'm reading quite a bit of AI training papers, which happen mostly to be Google papers.
    There they don't talk about GPUs but about TPU which are BF16.
    For example this quote "All models are trained in Tensorflow [25] using the Lingvo [26] toolkit on 8x8 Tensor Processing Units (TPU) slices with a global batch size of 4,096."
    If for Google, training works with BF16 and they design their TPUs to be BF16, that gives them a huge edge over people who are told to stick with FP32/TF32.
     
  18. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,532
    Likes Received:
    2,217
    https://forum.beyond3d.com/posts/2126178/

    https://www.nextplatform.com/2020/05/14/nvidia-unifies-ai-compute-with-ampere-gpu/
     
  19. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,084
    Likes Received:
    945
    Location:
    Earth
    I wasn't trying to prove anything. I just wrote what I have seen happen in real life. Based on those people I know and have worked with it's typical to implement fp32 training. Then optimize to lower precision and compare against fp32 model. Sometimes lower precision optimizations work out, sometimes not. Inference on the other hand is very different animal.

    Lot of the dnn research/development is done by folks who are surprisingly computer illiterate. Those sciency folks just like to have high precision+python and make things work for their papers. It's whole another talent to take that research and optimize the hell out of it when making something production worthy.
     
    #179 manux, May 27, 2020
    Last edited: May 27, 2020
  20. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,084
    Likes Received:
    945
    Location:
    Earth
    So I googled bert fp32 vs fp16. BERT is all the rage nowdays. A little bit surprisingly I found a data point showing what I was trying to anecdotally share. Unfortunately the blog post doesn't give comparison on accuracy fp32 vs. fp16 trained model. Would be interesting to know if fp16 trained network matches fp32 trained in accuracy or if there is some small loss.


    https://news.developer.nvidia.com/nvidia-achieves-4x-speedup-on-bert-neural-network/
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...