AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

AMDs strength with V20 is it's DP and FP32 performance.
Maybe against Volta PCI-E, but Volta NVLink has 15.7 TF FP32, and 7.8 TF FP64. MI60 has 14.7 TF FP32 and 7.4 TF FP64. That's why Volta still remains faster.

True. However, this time, to my knowledge, AMD have not mentioned gaming at all, and they have said that they have Navi coming for gaming at 7nm.
David Wang on stage has specifically said that the card is not for consumers and is designed specifically for enterprise. They didn't even price that SKU, as they intend to sell it to cloud providers directly.

AMD can't actually allocate enough 7nm capacity for all the Vega 20, Epyc 2 and Ryzen 3000. These will get priority before even Navi or any consumer GPU.
 
Last edited:
AMD Beats Intel, Nvidia to 7 nm
The company took a different approach to AI than its rival Nvidia, which bolted on multiply-accumulate units to its GPU. AMD added support in all of its compute units for formats from 4- and 8-bit integers to 16-, 32- and 64-bit floating-point math. They use mixed-precision 32-bit accumulators.

“We wanted a highly flexible accelerator, not one dedicated to FP16,” said Evan Groenke, an AMD senior product manager.

The result is a chip that generally delivers within 7% of a Volta’s performance before optimizations with less than half the die area (331 mm2 compared to 800+ mm2). “You don’t need large dedicated silicon blocks to get performance gains in machine learning,” said Groenke.

Specifically, AMD said that Vega will deliver 29.5 Tera FP16 operations/second for AI training. In inference jobs, it can hit 59 TOPS for 8-bit integer and 118 TOPS for 4-bit integer tasks.

In addition, AMD added hardware virtualization to the chip. Thus, one 7-nm Vega can support up to 16 virtual machines or a single virtual machine can split its work across more than eight GPUs.

Sales of the cards will depend on uptake of the open-source ROCm software that AMD released for GPU computing. The company announced an updated version of the code now supporting more machine-learning frameworks, math libraries, Docker, and Kubernetes.
https://www.eetimes.com/document.asp?_mc=RSS_EET_EDT&doc_id=1333944&page_number=2
 
I think the point is fairly obvious: AMDs Instinct product line has been clocked at pretty much the same frequencies as the consumer cards based of respective chips, which contradicts the notion of “+20% for gaming Vega 20”, academic as it may be as it is surely not happening.

?
Obviously, AMD is switching Vega over from Global Foundries to TSMC only once... which contradicts your whole post and assumptions. AMD claims they are also getting an uptick in performance, for just using and masking on TSMC process alone. Before you move anything to 7nm. It supposedly has much to do with layering and thermals and tools.



I don't see such rebuttals as debates against a 7nm consumer gaming cards (w/ their own "Gamer's" masking).

Again, such a spin would be a cheap remasked V20 for gaming. Given AMD's patents & modularity on their GPU uArch, AMD can certainly and quite easily spin a new taping of a vega20, without all the far-fetched (transistor/space/wattage robbing) machine learning & HBM2 aspects of Vega20.. and present the Gaming Community with a small 7nm wafer aimed at being the king of sub-4k gaming configs. And with good profit margin at sub $500 prices.



The question is, not if AMD is able to deliver such a chip shortly... the question is, does it make a sensible business decision..? And does Dr Su have the balls..? Such things, are what we are discussing when mentioning 7nm and gaming card... not a truncated card using the V20, w reduced HBM2 memory, etc. (A new masking before navi).

Vega & Zen moving forward will be at 7nm.

So how big would RX Vega20's chip be, if it was a bare bones Vega20 with all the "business" aspects masked out..? How much power/wattage would it use..? How much would it cost..? AMD already payed the upfront cost to have first access to TSMC's 7nm node. It would be relatively cheap for AMD to spin off a 300mm2 sized Vega80 w/gddr6 @ 275w. Or whether or not doing so would it hinder TSMC capacity to make lucrative Vega20 chips?
 
Yes they are not.

In May 2017, V100 did ~600 images per second.
res_net50_v100-1.png

https://devblogs.nvidia.com/inside-volta/

In May 2018, NVIDIA made software improvements, and V100 now does 1075 images per second.

https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
On what precision and settings? ResNet-50 offers several precisions to choose from, for example MI25 vs MI60 comparison was done at FP16 while MI60 vs Tesla was FP32

RIV-13: Testing Conducted by AMD performance labs on October 31, 2018, on a system comprising of Dual Socket Intel Xeon Gold 6130, 256GB DDR4 system memory, Ubuntu 16.04.5 LTS, NVIDIA Tesla V100 PCIe with CUDA 10.0.130 and CUDNN 7.3, AMD Radeon Instinct MI60 graphics, ROCm 19.224 driver, TensorFlow 1.11. Benchmark application: Resnet50 FP32 batch size 256
 
Couple mistakes there, they didn't "add" hardare virtualization to the chip, it has 3rd gen hardware virtualization andVega 10 (MI25) had hardware virtualization too so it's not even new to Vega.
Also one virtual machine can spread work across maximum of 8 GPUs, not over 8 GPUs.

Why would you restrict Volta to FP32 only when Tensor Cores are available?
To look better in the comparison of course? I have no clue if there's real world reason to use FP32 there, but if ResNet allows that precision, I'm pretty sure there has to be a reason for it too.
 
A card with a single mini-DP output and no fan because it's made for inserting into racks with standard airflow designs is not meant for consumers?

You don't say!
:runaway:
 
Why would you restrict Volta to FP32 only when Tensor Cores are available?
Code:
V-dot2-f32-f16
V-dot2-i32-i16
V-dot2-u32-u16
V-dot2-i32-i8
V-dot4-u32-i8
V-dot8-i32-i4
V-dot8-u32-u4
Why wouldn't you post fp16 performance numbers, if you have just added dot product instructions to the V20 ISA?

Assuming that they run at the usual 1T latency, that means V20 matches the Tensor Cores in terms of features (accumulation into higher precision register), and it's only 50% behind in performance on the interesting FP16 matrix multiplication with FP32 accumulator.

Well, it depends if these instructions are in the form "v-dot fp32(inout), fp16[2](in), fp16[2](in)" or in the form "v-dot fp32(out), fp16[2](in), fp16[2](in)". That makes the difference between approaching Tensor Core performance 1:2 or falling behind by 1:4 for vector length n due to need for a reduction of partial sums.
 
Last edited:
Will be interesting to see how well V20 scales since no one does training or inferencing work with just a single GPU.
Did they mention what the batch size was for the ResNet-50 benchmark? In past V100 vs TPU2 benchmarks the results went either way based on batch sizes.

Likely will have to wait for independent reviews to get reliable results. Right now it's just marketing jibberish ...
 
Will be interesting to see how well V20 scales since no one does training or inferencing work with just a single GPU.
Did they mention what the batch size was for the ResNet-50 benchmark? In past V100 vs TPU2 benchmarks the results went either way based on batch sizes.

Likely will have to wait for independent reviews to get reliable results. Right now it's just marketing jibberish ...

To quote myself few posts back, it's 256.
RIV-13: Testing Conducted by AMD performance labs on October 31, 2018, on a system comprising of Dual Socket Intel Xeon Gold 6130, 256GB DDR4 system memory, Ubuntu 16.04.5 LTS, NVIDIA Tesla V100 PCIe with CUDA 10.0.130 and CUDNN 7.3, AMD Radeon Instinct MI60 graphics, ROCm 19.224 driver, TensorFlow 1.11. Benchmark application: Resnet50 FP32 batch size 256

Also, for scaling 2 cards @ ~1.99x, 4 cards @ ~3.98x, 8 cards @ ~7.64x
RIV-11: Testing Conducted by AMD performance labs on October 31, 2018, on a system comprising of Dual Intel Xeon Gold 6132, 256GB DDR4 system memory, Ubuntu 16.04.5 LTS, AMD Radeon Instinct MI60 graphics running at 1600e/500m, ROCm 19.224 driver, TensorFlow 1.11. Benchmark application: Resnet50 FP32 batch size 256. 1x AMD Radeon Instinct MI60 = 278.63images/s, 2x Radeon Instinct MI60 = 553.98 images/s. Performance differential: 553.98/278.63 = 1.99x times more performance than 1x Radeon Instinct MI60. 4x Radeon Instinct MI60 = 1109.24 images/s. Performance differential: 1109.24/278.63 = 3.98x times more performance than 1x Radeon Instinct MI60. 8x Radeon Instinct MI60 = 2128.33 images/s. Performance differential: 2128.33/278.63 = 7.64x times more performance than 1x Radeon Instinct MI60. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations
 
Why wouldn't you post fp16 performance numbers, if you have just added dot product instructions to the V20 ISA?
In one of the videos the AMD representative said they would have lower performance if the comparison is made at FP16 with Volta's tensor cores.
 
In one of the videos the AMD representative said they would have lower performance if the comparison is made at FP16 with Volta's tensor cores.
Which would be odd, given the instruction set extensions, and the fact that FP16 perf with Tensor Cores is also just 2x FP32 perf. And at least for simpler FP16 instructions, Vega10 did already achieve the performance target, too.

Either the instruction set extension is the less ideal option of the two possibilities, some of the new instructions don't achieve full throughput, or there is an unexpected other bottleneck with FP16.
Either way, really odd. And it would still be interesting to know by how much it's actually slower, not just a generic "we don't want to tell numbers because they look worse". I know I'm repeating myself, but it shouldn't be by much.
 
Last edited:
Assuming that they run at the usual 1T latency, that means V20 matches the Tensor Cores both in terms of features (accumulation into higher precision register) as well as speed for vector-matrix and matrix-matrix multiplications, from FP16 input to FP32 target.

Could you please explain, how this should work for V20, as i'm not really getting it. In my understanding both Volta and V20 have FP16 Double Rate, both around 30 TFlops. V20 has 4x Int8 rate and 8x Int4 rate. That's it. Volta on the other side has additional TCs which get it to 120 Tflops with tensor cores. Or are the definitions of TOPs from AMD and Nvidia different?
Nothing in AMDs presentation or webpage for MI60 indicates, that it could reach the tensor core performance. As Kaotik wrote, the comparison MI60 vs Tesla by AMD was made using FP32. This way Volta only uses it's standard FP32 rate and isn't using it's tensor cores, as the TCs only support mixed precision and no pure FP32.
 
You need to understand that Tensor cores are special function hardware that does half precision matrix multiplication for 16x16 matrices.
Could you please explain, how this should work for V20, as i'm not really getting it. In my understanding both Volta and V20 have FP16 Double Rate, both around 30 TFlops. V20 has 4x Int8 rate and 8x Int4 rate. That's it. Volta on the other side has additional TCs which get it to 120 Tflops with tensor cores. Or are the definitions of TOPs from AMD and Nvidia different?
Nothing in AMDs presentation or webpage for MI60 indicates, that it could reach the tensor core performance. As Kaotik wrote, the comparison MI60 vs Tesla by AMD was made using FP32. This way Volta only uses it's standard FP32 rate and isn't using it's tensor cores, as the TCs only support mixed precision and no pure FP32.

Tensor cores are specialized matrix multiplication and accumulation units, and don't behave by the normal rules with respect to "ordinary" ALU pipelines. The key to note is that they can run with input/output saturating the register bandwidth of the processor, but a matrix multiplication has a lot of data reuse internally, so bandwidth is effectively amplified by 4x over a half precision FMA operation. A single 4x4 matrix multiply and accumulate performs 128 flops, and there are 4 units per warp (1 per 8 threads, issued cooperatively - programming model is strange with each of those threads owning a portion of the registers) for a total of 16 flops per clock per thread. A FMA performs 2 flops (multiply and add), but is issued over 32 units (one per thread), for a total of 2 flops per clock per thread. Half precision has 64 units per warp (2 wide SIMD per thread) , and thus can perform 4 flops per clock per thread.

Since deep learning is basically a bunch of matrix multiplication, the tensor cores are able to run at full tilt, so you get the insane 120 TFlops number.
 
Could you please explain, how this should work for V20, as i'm not really getting it. In my understanding both Volta and V20 have FP16 Double Rate, both around 30 TFlops. V20 has 4x Int8 rate and 8x Int4 rate. That's it. Volta on the other side has additional TCs which get it to 120 Tflops with tensor cores. Or are the definitions of TOPs from AMD and Nvidia different?
Nothing in AMDs presentation or webpage for MI60 indicates, that it could reach the tensor core performance. As Kaotik wrote, the comparison MI60 vs Tesla by AMD was made using FP32. This way Volta only uses it's standard FP32 rate and isn't using it's tensor cores, as the TCs only support mixed precision and no pure FP32.
In addition to the answer from @keldor, the dot product instructions in V20 also perform more than the previous 2 flops per instruction, but rather 3 or 4 flops depending on whether the accumulator can be passed in. (If it's just 3 flop you will unfortunately need another plain FP32 ADD which only gives you a single flop for a single instruction.) Which results in an increase from 30 Tflops to 45 or 60 Tflops for FP16.

I am confused about the numbers for the tensor-cores, thought they were just 30 Tflops in FP16, not 120. Thought 120 was int4 perf, which was so creatively published under the caption "flops" too.

Edit: And I did get the math wrong again, and the FP16, vectorized FMA instructions in Vega did already count as 4 flops too.

Edit2: Well, it is just 60Tflops for the relevant FP32 accumulator operation mode.
 
Last edited:
Either way, really odd. And it would still be interesting to know by how much it's actually slower, not just a generic "we don't want to tell numbers because they look worse". I know I'm repeating myself, but it shouldn't be by much.
Judging by NVIDIA scores with the tensor cores. The MI60 has a lot less performance than V100.
 
Back
Top