AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,833
    Likes Received:
    2,663
    Maybe against Volta PCI-E, but Volta NVLink has 15.7 TF FP32, and 7.8 TF FP64. MI60 has 14.7 TF FP32 and 7.4 TF FP64. That's why Volta still remains faster.

    David Wang on stage has specifically said that the card is not for consumers and is designed specifically for enterprise. They didn't even price that SKU, as they intend to sell it to cloud providers directly.

    AMD can't actually allocate enough 7nm capacity for all the Vega 20, Epyc 2 and Ryzen 3000. These will get priority before even Navi or any consumer GPU.
     
    #5661 DavidGraham, Nov 7, 2018
    Last edited: Nov 7, 2018
  2. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,998
    Likes Received:
    1,685
    AMD Beats Intel, Nvidia to 7 nm
    https://www.eetimes.com/document.asp?_mc=RSS_EET_EDT&doc_id=1333944&page_number=2
     
    Heinrich4 and BRiT like this.
  3. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    159
    Likes Received:
    33
    ?
    Obviously, AMD is switching Vega over from Global Foundries to TSMC only once... which contradicts your whole post and assumptions. AMD claims they are also getting an uptick in performance, for just using and masking on TSMC process alone. Before you move anything to 7nm. It supposedly has much to do with layering and thermals and tools.



    I don't see such rebuttals as debates against a 7nm consumer gaming cards (w/ their own "Gamer's" masking).

    Again, such a spin would be a cheap remasked V20 for gaming. Given AMD's patents & modularity on their GPU uArch, AMD can certainly and quite easily spin a new taping of a vega20, without all the far-fetched (transistor/space/wattage robbing) machine learning & HBM2 aspects of Vega20.. and present the Gaming Community with a small 7nm wafer aimed at being the king of sub-4k gaming configs. And with good profit margin at sub $500 prices.



    The question is, not if AMD is able to deliver such a chip shortly... the question is, does it make a sensible business decision..? And does Dr Su have the balls..? Such things, are what we are discussing when mentioning 7nm and gaming card... not a truncated card using the V20, w reduced HBM2 memory, etc. (A new masking before navi).

    Vega & Zen moving forward will be at 7nm.

    So how big would RX Vega20's chip be, if it was a bare bones Vega20 with all the "business" aspects masked out..? How much power/wattage would it use..? How much would it cost..? AMD already payed the upfront cost to have first access to TSMC's 7nm node. It would be relatively cheap for AMD to spin off a 300mm2 sized Vega80 w/gddr6 @ 275w. Or whether or not doing so would it hinder TSMC capacity to make lucrative Vega20 chips?
     
  4. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,254
    Likes Received:
    1,937
    Location:
    Finland
    On what precision and settings? ResNet-50 offers several precisions to choose from, for example MI25 vs MI60 comparison was done at FP16 while MI60 vs Tesla was FP32

     
    Nemo, Ike Turner and Lightman like this.
  5. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,833
    Likes Received:
    2,663
    Why would you restrict Volta to FP32 only when Tensor Cores are available?
     
    xpea and A1xLLcqAgt0qc2RyMz0y like this.
  6. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,254
    Likes Received:
    1,937
    Location:
    Finland
    Couple mistakes there, they didn't "add" hardare virtualization to the chip, it has 3rd gen hardware virtualization andVega 10 (MI25) had hardware virtualization too so it's not even new to Vega.
    Also one virtual machine can spread work across maximum of 8 GPUs, not over 8 GPUs.

    To look better in the comparison of course? I have no clue if there's real world reason to use FP32 there, but if ResNet allows that precision, I'm pretty sure there has to be a reason for it too.
     
  7. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,073
    Likes Received:
    4,651
    A card with a single mini-DP output and no fan because it's made for inserting into racks with standard airflow designs is not meant for consumers?

    You don't say!
    :runaway:
     
  8. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    Code:
    V-dot2-f32-f16
    V-dot2-i32-i16
    V-dot2-u32-u16
    V-dot2-i32-i8
    V-dot4-u32-i8
    V-dot8-i32-i4
    V-dot8-u32-u4
    Why wouldn't you post fp16 performance numbers, if you have just added dot product instructions to the V20 ISA?

    Assuming that they run at the usual 1T latency, that means V20 matches the Tensor Cores in terms of features (accumulation into higher precision register), and it's only 50% behind in performance on the interesting FP16 matrix multiplication with FP32 accumulator.

    Well, it depends if these instructions are in the form "v-dot fp32(inout), fp16[2](in), fp16[2](in)" or in the form "v-dot fp32(out), fp16[2](in), fp16[2](in)". That makes the difference between approaching Tensor Core performance 1:2 or falling behind by 1:4 for vector length n due to need for a reduction of partial sums.
     
    #5668 Ext3h, Nov 7, 2018
    Last edited: Nov 8, 2018
  9. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,998
    Likes Received:
    1,685
    Will be interesting to see how well V20 scales since no one does training or inferencing work with just a single GPU.
    Did they mention what the batch size was for the ResNet-50 benchmark? In past V100 vs TPU2 benchmarks the results went either way based on batch sizes.

    Likely will have to wait for independent reviews to get reliable results. Right now it's just marketing jibberish ...
     
  10. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,254
    Likes Received:
    1,937
    Location:
    Finland
    To quote myself few posts back, it's 256.
    Also, for scaling 2 cards @ ~1.99x, 4 cards @ ~3.98x, 8 cards @ ~7.64x
     
    BRiT and pharma like this.
  11. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,833
    Likes Received:
    2,663
    In one of the videos the AMD representative said they would have lower performance if the comparison is made at FP16 with Volta's tensor cores.
     
  12. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    Which would be odd, given the instruction set extensions, and the fact that FP16 perf with Tensor Cores is also just 2x FP32 perf. And at least for simpler FP16 instructions, Vega10 did already achieve the performance target, too.

    Either the instruction set extension is the less ideal option of the two possibilities, some of the new instructions don't achieve full throughput, or there is an unexpected other bottleneck with FP16.
    Either way, really odd. And it would still be interesting to know by how much it's actually slower, not just a generic "we don't want to tell numbers because they look worse". I know I'm repeating myself, but it shouldn't be by much.
     
    #5672 Ext3h, Nov 7, 2018
    Last edited: Nov 8, 2018
  13. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    112
    Likes Received:
    129
    Could you please explain, how this should work for V20, as i'm not really getting it. In my understanding both Volta and V20 have FP16 Double Rate, both around 30 TFlops. V20 has 4x Int8 rate and 8x Int4 rate. That's it. Volta on the other side has additional TCs which get it to 120 Tflops with tensor cores. Or are the definitions of TOPs from AMD and Nvidia different?
    Nothing in AMDs presentation or webpage for MI60 indicates, that it could reach the tensor core performance. As Kaotik wrote, the comparison MI60 vs Tesla by AMD was made using FP32. This way Volta only uses it's standard FP32 rate and isn't using it's tensor cores, as the TCs only support mixed precision and no pure FP32.
     
  14. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    You need to understand that Tensor cores are special function hardware that does half precision matrix multiplication for 16x16 matrices.
    Tensor cores are specialized matrix multiplication and accumulation units, and don't behave by the normal rules with respect to "ordinary" ALU pipelines. The key to note is that they can run with input/output saturating the register bandwidth of the processor, but a matrix multiplication has a lot of data reuse internally, so bandwidth is effectively amplified by 4x over a half precision FMA operation. A single 4x4 matrix multiply and accumulate performs 128 flops, and there are 4 units per warp (1 per 8 threads, issued cooperatively - programming model is strange with each of those threads owning a portion of the registers) for a total of 16 flops per clock per thread. A FMA performs 2 flops (multiply and add), but is issued over 32 units (one per thread), for a total of 2 flops per clock per thread. Half precision has 64 units per warp (2 wide SIMD per thread) , and thus can perform 4 flops per clock per thread.

    Since deep learning is basically a bunch of matrix multiplication, the tensor cores are able to run at full tilt, so you get the insane 120 TFlops number.
     
  15. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    In addition to the answer from @keldor, the dot product instructions in V20 also perform more than the previous 2 flops per instruction, but rather 3 or 4 flops depending on whether the accumulator can be passed in. (If it's just 3 flop you will unfortunately need another plain FP32 ADD which only gives you a single flop for a single instruction.) Which results in an increase from 30 Tflops to 45 or 60 Tflops for FP16.

    I am confused about the numbers for the tensor-cores, thought they were just 30 Tflops in FP16, not 120. Thought 120 was int4 perf, which was so creatively published under the caption "flops" too.

    Edit: And I did get the math wrong again, and the FP16, vectorized FMA instructions in Vega did already count as 4 flops too.

    Edit2: Well, it is just 60Tflops for the relevant FP32 accumulator operation mode.
     
    #5675 Ext3h, Nov 8, 2018
    Last edited: Nov 8, 2018
  16. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,833
    Likes Received:
    2,663
    Judging by NVIDIA scores with the tensor cores. The MI60 has a lot less performance than V100.
     
    A1xLLcqAgt0qc2RyMz0y likes this.
  17. Pressure

    Veteran Regular

    Joined:
    Mar 30, 2004
    Messages:
    1,355
    Likes Received:
    283
    I'm not sure what you would expect comparing a 332mm2 die (13.2 billion transistors) to a 815mm2 die (21.1 billion transistors).
     
    w0lfram, Lightman, hoom and 4 others like this.
  18. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,073
    Likes Received:
    4,651
    AMD Incompetenzzz obviously.
     
  19. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,195
    Likes Received:
    591
    Location:
    France
    Are they sold at a similare price in the same market ?
     
  20. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,833
    Likes Received:
    2,663
    You forget we are talking 7nm vs 12nm.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...