AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

sir doris · Nov 8, 2018

DavidGraham said:
You forget we are talking 7nm vs 12nm.

Surely 13b vs 21b transistors wouldn't be comparable regardless of the process used?

Alexko · Nov 8, 2018

Rootax said:
Are they sold at a similare price in the same market ?

Same market, yes, but a similar price would be very surprising.

DavidGraham · Nov 8, 2018

sir doris said:
Surely 13b vs 21b transistors wouldn't be comparable regardless of the process used?

Maybe for evaluating price and cost (which are really far less important aspects to discuss in a tech forum), but for performance, architectural efficiency, power efficiency and scalability on different nodes, not so much.

You have 13b transitions consuming 300W on 7nm, vs 21b transistors doing the same thing on 12nm (basically 16nm). How is that irrelevant?

Shifty Geezer · Nov 8, 2018

DavidGraham said:
Maybe for evaluating price and cost (which are really far less important aspects to discuss in a tech forum), but for performance, architectural efficiency, power efficiency and scalability on different nodes, not so much.

You have 13b transitions consuming 300W on 7nm, vs 21b transistors doing the same thing on 12nm (basically 16nm). How is that irrelevant?

Watts doesn't matter. It's flops/watt (and flops/mm², etc), or rather, usable work per watt that matters. If a 300W part on 7nm with 21 gigatrannies can render 50 megayums of graphical lovelies, and a 300w part on 16 nm with 13 gigatrannies can render 40 megayums of graphical lovelies, the 7nm part is more effective.

Assuming you are choosing a part based on power efficiency. You may choose output per $.

Regardless, watts consumed means very little without useful benchmarks to compare workload.

Love_In_Rio · Nov 8, 2018

Shifty Geezer said:
Watts doesn't matter. It's flops/watt (and flops/mm², etc), or rather, usable work per watt that matters. If a 300W part on 7nm with 21 gigatrannies can render 50 megayums of graphical lovelies, and a 300w part on 16 nm with 13 gigatrannies can render 40 megayums of graphical lovelies, the 7nm part is more effective.

Assuming you are choosing a part based on power efficiency. You may choose output per $.

Regardless, watts consumed means very little without useful benchmarks to compare workload.

Right, and thats why at 7nm a Nvidia chip at 150 watts would trounce an AMD part at 150 watts, you would get 2/3 more fps. And if Nvidia was different its 150 watts chips would be in both Sony and MS consoles.

DavidGraham · Nov 8, 2018

Shifty Geezer said:
Watts doesn't matter. It's flops/watt (and flops/mm², etc), or rather, usable work per watt that matters. If a 300W part on 7nm with 21 gigatrannies can render 50 megayums of graphical lovelies, and a 300w part on 16 nm with 13 gigatrannies can render 40 megayums of graphical lovelies, the 7nm part is more effective.

The more transistors you can put in a chip in a given TDP, the more performance and features you can have out of the chip. Volta is large because it's on an old node, and because it has extra Tensor Cores which enables it to have vastly more AI performance than MI60. It's also vastly more powerful in rasterization. It has more ROPs, TMUs and polygon throughput than Vega 20. Even a TitanV is (which is a cut down chip) is at least 50% faster in rasterization.

Seeing this situation, NVIDIA can double the RTX hardware on 7nm, considerably increase their rasterization hardware, maintain chip size, and have extra features all while staying on the same power envelope.

Shifty Geezer said:
Regardless, watts consumed means very little without useful benchmarks to compare workload.

Agreed, that's why I included various performance aspects into that discussion.

Deleted member 13524 · Nov 9, 2018

Love_In_Rio said:
Right, and thats why at 7nm a Nvidia chip at 150 watts would trounce an AMD part at 150 watts, you would get 2/3 more fps.

Maybe.

But first it would have to exist.

keldor · Nov 9, 2018

Ext3h said:
In addition to the answer from @keldor, the dot product instructions in V20 also perform more than ~~the previous~~ 2 flops per instruction, but rather 3 or 4 flops depending on whether the accumulator can be passed in. (If it's just 3 flop you will unfortunately need another plain FP32 ADD which only gives you a single flop for a single instruction.) ~~Which results in an increase from 30 Tflops to 45 or 60 Tflops for FP16.~~

I am confused about the numbers for the tensor-cores, thought they were just 30 Tflops in FP16, not 120. Thought 120 was int4 perf, which was so creatively published under the caption "flops" too.

Edit: And I did get the math wrong again, and the FP16, vectorized FMA instructions in Vega did already count as 4 flops too.

Edit2: Well, it is just 60Tflops for the relevant FP32 accumulator operation mode.

I was using the Volta numbers. IIRC, Turing cut down on the Tensor core count by 1/2. It's a tradeoff - fewer Tensor cores means more of something else.

Anarchist4000 · Nov 9, 2018

Love_In_Rio said:
Right, and thats why at 7nm a Nvidia chip at 150 watts would trounce an AMD part at 150 watts, you would get 2/3 more fps. And if Nvidia was different its 150 watts chips would be in both Sony and MS consoles.

Except that metric is largely useless because of power curves. Really need to equalize for mm2 to compare architectures, and even that can be tricky. SRAM for example is transistor dense so transistors may not be valid for comparison. Fabs and process can vary comparisons a bit.

Practically any sized chip can be made to consume 150W and the chip with more area and lower clocks/voltages will almost always be more efficient on a similar node for parallel processing. Doubling processors will double performance at roughly twice the power. Double clocks and power explodes as the curve is exponential.

With the die size in that comparison, you're probably looking at 3-4 Vega 20s versus a single Volta. Equalize power and the Vegas may be more efficient and offer far more bandwidth.

Ext3h · Nov 9, 2018

keldor said:
I was using the Volta numbers. IIRC, Turing cut down on the Tensor core count by 1/2. It's a tradeoff - fewer Tensor cores means more of something else.

It's still the same number of cores. Respectively what they could fit within the power budget, assuming a primarily one-sided workload stressing only one core type to the limit simultaneously.
Then there has also been the T4 announcement, which runs at full core count, but only half the clock, for half the Tensor Core throughput of the consumer cards.

So to sum that topic up, for the fastest chip in each generation (Vega64, MI60, Tesla V100, Quadro RTX 6000). All according to published data sheets, this time.

Matrix-multiplication, with FP16 input, FP16 accumulator:

Vega10: ~28 Tflops
Vega20: ~30 Tflops
Pascal: ~22 Tflops
Volta: ~30 Tflops on FP32 core OR ~120 Tflops on Tensor Core
Turing: ~33 Tflops on FP32 core OR ~130 Tflops on Tensor Core

Matrix-multiplication, with FP16 input, FP32 accumulator:

Vega10: ~14 Tflops
Vega20: ~30 Tflops (to be confirmed, but likely)
Pascal: ~11 Tflops
Volta: ~15 Tflops on FP32 core OR ~120 Tflops on Tensor Core
Turing: ~16 Tflops on FP32 core OR ~130 Tflops on Tensor Core (only 57 Tflops on GeForce)

Matrix-multiplication, with FP32 input, FP32 accumulator:

Vega10: ~14 Tflops
Vega20: ~15 Tflops
Pascal: ~11 Tflops
Volta: ~14 Tflops
Turing: ~16 Tflops

DavidGraham · Nov 9, 2018

Anarchist4000 said:
With the die size in that comparison, you're probably looking at 3-4 Vega 20s versus a single Volta.

Before you do any die size comparison, you should equalize the node first.

Anarchist4000 said:
Equalize power and the Vegas may be more efficient and offer far more bandwidth.

Volta offers 900GB of HBM2 bandwidth, vs 1TB for Vega 20, not a huge difference. Volta PCIE also requires only 250W. Less than Vega 20 by 50W. It's more power efficient even on the 12/16nm node. At 7nm it will consume far less power at same clocks (125w?) and has it's size shrunk from 815 to at most 600mm2.

Anarchist4000 said:
the chip with more area and lower clocks/voltages will almost always be more efficient on a similar node for parallel processing.

That's an inaccurate generalization. Vega 10 is more wide, has bigger area and lower clocks, yet it consumes far more power than GP104, which is clocked to the max. You are excluding power efficiency for a given architecture, which is the more determining factor really.

Anarchist4000 said:
Double clocks and power explodes as the curve is exponential.

We are not talking about a situation where the clock is doubled here.

Ext3h said:
Matrix-multiplication, with FP32 input, FP32 accumulator:
Vega20: ~15 Tflops
Volta: ~14 Tflops
Turing: ~16 Tflops

I am assuming you are using the lower clocked PCIE V100, right?
If you base your numbers on the NVLink version of V100, the Volta value should be 15.7 TF.

SpaceBeer · Nov 9, 2018

If 14nm -> 7nm die didn't bring significant power savings for AMD, it won't for nVidia either. Especially if TSMC's 16/12nm is already better than GloFo's 14nm. Ie. if nVidia clocks their 7nm chips 10-15% higher than Volta/Turing, they will also consume 250-300W

Anarchist4000 · Nov 9, 2018

DavidGraham said:
Before you do any die size comparison, you should equalize the node first.

Which is exactly why I stated just that in my post.

DavidGraham said:
Volta offers 900GB of HBM2 bandwidth, vs 1TB for Vega 20, not a huge difference. Volta PCIE also requires only 250W. Less than Vega 20 by 50W. It's more power efficient even on the 12/16nm node. At 7nm it will consume far less power at same clocks (125w?) and has it's size shrunk from 815 to at most 600mm2.

Except the comparison is against multiple Vegas with 2-3x or more aggregate bandwidth. Your whole power argument is plain silly because of the curves and parallel nature of the workload. Could probably make 10 Vegas use less power than Volta with more performance if you wanted. Doubt that's cost effective unless memory bandwidth crucial.

DavidGraham said:
That's an inaccurate generalization. Vega 10 is more wide, has bigger area and lower clocks, yet it consumes far more power than GP104, which is clocked to the max. You are excluding power efficiency for a given architecture, which is the more determining factor really.

Again, power curves which was the entire point of my post. Dial it back a bit and more FLOPs for less power. Not even accounting for node or other factors. Which again as I stated in my post need to be taken into account.

DavidGraham said:
We are not talking about a situation where the clock is doubled here.

Sure we are when you consider power curves and the theoretical model. Cut the clocks in half and power plummits. Perf/watt skyrocketing in the process. Did I mention power curves? Just want to make sure the premise of my post wasn't misunderstood.

DavidGraham · Nov 9, 2018

SpaceBeer said:
If 14nm -> 7nm die didn't bring significant power savings for AMD, it won't for nVidia either.

Of course It does bring savings to Vega 20, problem is these savings are offset by the clock increase, IO additions and extra chip features. Things Volta already paid for on 12nm.

Anarchist4000 said:
Could probably make 10 Vegas use less power than Volta with more performance if you wanted. Doubt that's cost effective unless memory bandwidth crucial.

We are not arguing theoreticals here, but actual implementations. I can claim you can make 20 Voltas consume 100w. Doesn't mean it's true, or it's doable in any practical or useful manner.

Anarchist4000 said:
Again, power curves which was the entire point of my post. Dial it back a bit and more FLOPs for less power. Not even accounting for node or other factors.

And that's your problem right there, you are talking in a vacuum, consider other factors like arc power efficiency, nodes, features .. etc. And your power curve point suddenly becomes moot. As it actually applies to all architectures.

Anarchist4000 · Nov 9, 2018

DavidGraham said:
We are not arguing theoreticals here, but actual implementations. I can claim you can make 20 Voltas consume 100w. Doesn't mean it's true, or it's doable in any practical or useful manner.

If the clocks we're adjusted, that would be the implementation. No requirement to run the cards at stock settings. 20 Voltas probably won't work with the power floor, but yes the same analogy would work if Voltas were significantly cheaper than the Vegas. Last I checked they weren't as we're comparing a die a fraction of the size.

DavidGraham said:
And that's your problem right there, you are talking in a vacuum, consider other factors like arc power efficiency, nodes, features .. etc. And your power curve point suddenly becomes moot. As it actually applies to all architectures.

I did consider them in the last two posts I made and explicitly pointed it out. Not sure I see the problem other than you missing the comparison.

Kaotik · Nov 9, 2018

DavidGraham said:
That's an inaccurate generalization. Vega 10 is more wide, has bigger area and lower clocks, yet it consumes far more power than GP104, which is clocked to the max. You are excluding power efficiency for a given architecture, which is the more determining factor really.

Vega 10 is clocked and using voltages far closer to actual limit of the chip (which is combination of architecture and process) than any model of GP104. They don't use equal processes either.
For what it's worth, many users actually achieve higher performance and lower consumption with their Vegas by simply lowering the voltage a tad

DavidGraham · Nov 9, 2018

Kaotik said:
For what it's worth, many users actually achieve higher performance and lower consumption with their Vegas by simply lowering the voltage a tad

That's chip lottery. Also it doesn't really guarantee 100% stability across all workloads.
On the other spectrum, many GP104 users are running their cards @2.1GHz, getting extra performance while still consuming far less power than Vega.

Kaotik said:
Vega 10 is clocked and using voltages far closer to actual limit of the chip (which is combination of architecture and process) than any model of GP104.

I don't get that sentence, please elaborate.

manux · Nov 9, 2018

Ext3h said:
It's still the same number of cores. Respectively what they could fit within the power budget, assuming a primarily one-sided workload stressing only one core type to the limit simultaneously.
Then there has also been the T4 announcement, which runs at full core count, but only half the clock, for half the Tensor Core throughput of the consumer cards.

So to sum that topic up, for the fastest chip in each generation (Vega64, MI60, Tesla V100, Quadro RTX 6000). All according to published data sheets, this time.

Matrix-multiplication, with FP16 input, FP16 accumulator:

Vega10: ~28 Tflops
Vega20: ~30 Tflops
Pascal: ~22 Tflops
Volta: ~30 Tflops on FP32 core OR ~120 Tflops on Tensor Core
Turing: ~33 Tflops on FP32 core OR ~130 Tflops on Tensor Core

Matrix-multiplication, with FP16 input, FP32 accumulator:

Vega10: ~14 Tflops
Vega20: ~30 Tflops (to be confirmed, but likely)
Pascal: ~11 Tflops
Volta: ~15 Tflops on FP32 core OR ~60 Tflops on Tensor Core
Turing: ~16 Tflops on FP32 core OR ~65 Tflops on Tensor Core

Matrix-multiplication, with FP32 input, FP32 accumulator:

Vega10: ~14 Tflops
Vega20: ~15 Tflops
Pascal: ~11 Tflops
Volta: ~14 Tflops
Turing: ~16 Tflops

Tensor core fp32 accumulation on volta and quadro turing is full speed. 2080ti is half speed and that's probably where confusion comes from. Turing also has 8bit and 4bit tensor cores.

https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/

Deleted member 13524 · Nov 9, 2018

beyondtest said:

Some interesting tidbits from this video, about Vega 20.

Bandwidth is better explained at ~~22m00s
GPU-to-GPU total bandwidth is 200GB/s from the IF link, plus 64GB/s through the 16 PCIe 4.0 lanes, so a total of 264GB/s.
He claims the latency of Vega 20 A to access Vega B's memory pool is 60-70ns, and one "complete loop" between neighboor GPUs is 140-170ns.

Kaotik · Nov 9, 2018

DavidGraham said:
I don't get that sentence, please elaborate.

Every chip has a limit on how high it can clock. Once you get closer to it the voltage requirements start to ramp up rapidly and the power consumption grows exponentially. Limit depends on both the architecture and the process the chips is built on (as well as individual variation between each chip). (also, GP104 and Vega 10 are not built on same or equal processes)

Vega models are clocked and are using voltages far closer to actual limits of the chip and chosen so that every card should be able to meet the advertized clockspeeds at pre-set situation even if it means using higher voltage on many of the cards. Being close to it's limits, lowering the voltage and/or clocks a bit makes a huge difference on power consumption here.
This behaviour is nicely demonstrated for example in TPU's Vega 64 review (https://www.techpowerup.com/reviews/AMD/Radeon_RX_Vega_64/)
Using (primary) Balanced profile has GPU consumption limit at 220W, Turbo-profile at 253W and Power Saver -profile at 165W. Vega being already so close to it's limits, you can reach about 1 % higher performance with 15 % higher consumption. On the other hand for the very same reason, using Power Saver -profile cuts your performance by only 4 % while your power limit goes down by 25 % (in other words losing 4 % performance gives you 33% higher energy efficiency). No card running in a "comfortable range" would have such extreme differences between the profiles

NVIDIA on the other hand has had the luxury to be more moderate with their clocks and thus voltage, they had a lot of headroom on both the clocks and the voltage to go higher, but they didn't need to. Being in more comfortable, dare I even say optimal clockrange for the chip, lowering the voltage and/or clocks a bit makes a smaller difference here.

DavidGraham said:
That's chip lottery. Also it doesn't really guarantee 100% stability across all workloads.
On the other spectrum, many GP104 users are running their cards @2.1GHz, getting extra performance while still consuming far less power than Vega.

Just like GP104 performance even out of the box, but I don't think I've heard a single one that didn't benefit from lowering voltage.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

sir doris

Alexko

DavidGraham

Shifty Geezer

uber-Troll!

Love_In_Rio

DavidGraham

Deleted member 13524

Guest

keldor

Anarchist4000

Ext3h

DavidGraham

SpaceBeer

Anarchist4000

DavidGraham

Anarchist4000

Kaotik

Drunk Member

DavidGraham

manux

Deleted member 13524

Guest

Kaotik

Drunk Member