Surely 13b vs 21b transistors wouldn't be comparable regardless of the process used?You forget we are talking 7nm vs 12nm.
Surely 13b vs 21b transistors wouldn't be comparable regardless of the process used?You forget we are talking 7nm vs 12nm.
Are they sold at a similare price in the same market ?
Maybe for evaluating price and cost (which are really far less important aspects to discuss in a tech forum), but for performance, architectural efficiency, power efficiency and scalability on different nodes, not so much.Surely 13b vs 21b transistors wouldn't be comparable regardless of the process used?
Watts doesn't matter. It's flops/watt (and flops/mm², etc), or rather, usable work per watt that matters. If a 300W part on 7nm with 21 gigatrannies can render 50 megayums of graphical lovelies, and a 300w part on 16 nm with 13 gigatrannies can render 40 megayums of graphical lovelies, the 7nm part is more effective.Maybe for evaluating price and cost (which are really far less important aspects to discuss in a tech forum), but for performance, architectural efficiency, power efficiency and scalability on different nodes, not so much.
You have 13b transitions consuming 300W on 7nm, vs 21b transistors doing the same thing on 12nm (basically 16nm). How is that irrelevant?
Right, and thats why at 7nm a Nvidia chip at 150 watts would trounce an AMD part at 150 watts, you would get 2/3 more fps. And if Nvidia was different its 150 watts chips would be in both Sony and MS consoles.Watts doesn't matter. It's flops/watt (and flops/mm², etc), or rather, usable work per watt that matters. If a 300W part on 7nm with 21 gigatrannies can render 50 megayums of graphical lovelies, and a 300w part on 16 nm with 13 gigatrannies can render 40 megayums of graphical lovelies, the 7nm part is more effective.
Assuming you are choosing a part based on power efficiency. You may choose output per $.
Regardless, watts consumed means very little without useful benchmarks to compare workload.
The more transistors you can put in a chip in a given TDP, the more performance and features you can have out of the chip. Volta is large because it's on an old node, and because it has extra Tensor Cores which enables it to have vastly more AI performance than MI60. It's also vastly more powerful in rasterization. It has more ROPs, TMUs and polygon throughput than Vega 20. Even a TitanV is (which is a cut down chip) is at least 50% faster in rasterization.Watts doesn't matter. It's flops/watt (and flops/mm², etc), or rather, usable work per watt that matters. If a 300W part on 7nm with 21 gigatrannies can render 50 megayums of graphical lovelies, and a 300w part on 16 nm with 13 gigatrannies can render 40 megayums of graphical lovelies, the 7nm part is more effective.
Agreed, that's why I included various performance aspects into that discussion.Regardless, watts consumed means very little without useful benchmarks to compare workload.
Maybe.Right, and thats why at 7nm a Nvidia chip at 150 watts would trounce an AMD part at 150 watts, you would get 2/3 more fps.
In addition to the answer from @keldor, the dot product instructions in V20 also perform more thanthe previous2 flops per instruction, but rather 3 or 4 flops depending on whether the accumulator can be passed in. (If it's just 3 flop you will unfortunately need another plain FP32 ADD which only gives you a single flop for a single instruction.)Which results in an increase from 30 Tflops to 45 or 60 Tflops for FP16.
I am confused about the numbers for the tensor-cores, thought they were just 30 Tflops in FP16, not 120. Thought 120 was int4 perf, which was so creatively published under the caption "flops" too.
Edit: And I did get the math wrong again, and the FP16, vectorized FMA instructions in Vega did already count as 4 flops too.
Edit2: Well, it is just 60Tflops for the relevant FP32 accumulator operation mode.
Except that metric is largely useless because of power curves. Really need to equalize for mm2 to compare architectures, and even that can be tricky. SRAM for example is transistor dense so transistors may not be valid for comparison. Fabs and process can vary comparisons a bit.Right, and thats why at 7nm a Nvidia chip at 150 watts would trounce an AMD part at 150 watts, you would get 2/3 more fps. And if Nvidia was different its 150 watts chips would be in both Sony and MS consoles.
It's still the same number of cores. Respectively what they could fit within the power budget, assuming a primarily one-sided workload stressing only one core type to the limit simultaneously.I was using the Volta numbers. IIRC, Turing cut down on the Tensor core count by 1/2. It's a tradeoff - fewer Tensor cores means more of something else.
Before you do any die size comparison, you should equalize the node first.With the die size in that comparison, you're probably looking at 3-4 Vega 20s versus a single Volta.
Volta offers 900GB of HBM2 bandwidth, vs 1TB for Vega 20, not a huge difference. Volta PCIE also requires only 250W. Less than Vega 20 by 50W. It's more power efficient even on the 12/16nm node. At 7nm it will consume far less power at same clocks (125w?) and has it's size shrunk from 815 to at most 600mm2.Equalize power and the Vegas may be more efficient and offer far more bandwidth.
That's an inaccurate generalization. Vega 10 is more wide, has bigger area and lower clocks, yet it consumes far more power than GP104, which is clocked to the max. You are excluding power efficiency for a given architecture, which is the more determining factor really.the chip with more area and lower clocks/voltages will almost always be more efficient on a similar node for parallel processing.
We are not talking about a situation where the clock is doubled here.Double clocks and power explodes as the curve is exponential.
I am assuming you are using the lower clocked PCIE V100, right?Matrix-multiplication, with FP32 input, FP32 accumulator:
Vega20: ~15 Tflops
Volta: ~14 Tflops
Turing: ~16 Tflops
Which is exactly why I stated just that in my post.Before you do any die size comparison, you should equalize the node first.
Except the comparison is against multiple Vegas with 2-3x or more aggregate bandwidth. Your whole power argument is plain silly because of the curves and parallel nature of the workload. Could probably make 10 Vegas use less power than Volta with more performance if you wanted. Doubt that's cost effective unless memory bandwidth crucial.Volta offers 900GB of HBM2 bandwidth, vs 1TB for Vega 20, not a huge difference. Volta PCIE also requires only 250W. Less than Vega 20 by 50W. It's more power efficient even on the 12/16nm node. At 7nm it will consume far less power at same clocks (125w?) and has it's size shrunk from 815 to at most 600mm2.
Again, power curves which was the entire point of my post. Dial it back a bit and more FLOPs for less power. Not even accounting for node or other factors. Which again as I stated in my post need to be taken into account.That's an inaccurate generalization. Vega 10 is more wide, has bigger area and lower clocks, yet it consumes far more power than GP104, which is clocked to the max. You are excluding power efficiency for a given architecture, which is the more determining factor really.
Sure we are when you consider power curves and the theoretical model. Cut the clocks in half and power plummits. Perf/watt skyrocketing in the process. Did I mention power curves? Just want to make sure the premise of my post wasn't misunderstood.We are not talking about a situation where the clock is doubled here.
Of course It does bring savings to Vega 20, problem is these savings are offset by the clock increase, IO additions and extra chip features. Things Volta already paid for on 12nm.If 14nm -> 7nm die didn't bring significant power savings for AMD, it won't for nVidia either.
We are not arguing theoreticals here, but actual implementations. I can claim you can make 20 Voltas consume 100w. Doesn't mean it's true, or it's doable in any practical or useful manner.Could probably make 10 Vegas use less power than Volta with more performance if you wanted. Doubt that's cost effective unless memory bandwidth crucial.
And that's your problem right there, you are talking in a vacuum, consider other factors like arc power efficiency, nodes, features .. etc. And your power curve point suddenly becomes moot. As it actually applies to all architectures.Again, power curves which was the entire point of my post. Dial it back a bit and more FLOPs for less power. Not even accounting for node or other factors.
If the clocks we're adjusted, that would be the implementation. No requirement to run the cards at stock settings. 20 Voltas probably won't work with the power floor, but yes the same analogy would work if Voltas were significantly cheaper than the Vegas. Last I checked they weren't as we're comparing a die a fraction of the size.We are not arguing theoreticals here, but actual implementations. I can claim you can make 20 Voltas consume 100w. Doesn't mean it's true, or it's doable in any practical or useful manner.
I did consider them in the last two posts I made and explicitly pointed it out. Not sure I see the problem other than you missing the comparison.And that's your problem right there, you are talking in a vacuum, consider other factors like arc power efficiency, nodes, features .. etc. And your power curve point suddenly becomes moot. As it actually applies to all architectures.
Vega 10 is clocked and using voltages far closer to actual limit of the chip (which is combination of architecture and process) than any model of GP104. They don't use equal processes either.That's an inaccurate generalization. Vega 10 is more wide, has bigger area and lower clocks, yet it consumes far more power than GP104, which is clocked to the max. You are excluding power efficiency for a given architecture, which is the more determining factor really.
That's chip lottery. Also it doesn't really guarantee 100% stability across all workloads.For what it's worth, many users actually achieve higher performance and lower consumption with their Vegas by simply lowering the voltage a tad
I don't get that sentence, please elaborate.Vega 10 is clocked and using voltages far closer to actual limit of the chip (which is combination of architecture and process) than any model of GP104.
It's still the same number of cores. Respectively what they could fit within the power budget, assuming a primarily one-sided workload stressing only one core type to the limit simultaneously.
Then there has also been the T4 announcement, which runs at full core count, but only half the clock, for half the Tensor Core throughput of the consumer cards.
So to sum that topic up, for the fastest chip in each generation (Vega64, MI60, Tesla V100, Quadro RTX 6000). All according to published data sheets, this time.
Matrix-multiplication, with FP16 input, FP16 accumulator:
Vega10: ~28 Tflops
Vega20: ~30 Tflops
Pascal: ~22 Tflops
Volta: ~30 Tflops on FP32 core OR ~120 Tflops on Tensor Core
Turing: ~33 Tflops on FP32 core OR ~130 Tflops on Tensor Core
Matrix-multiplication, with FP16 input, FP32 accumulator:
Vega10: ~14 Tflops
Vega20: ~30 Tflops (to be confirmed, but likely)
Pascal: ~11 Tflops
Volta: ~15 Tflops on FP32 core OR ~60 Tflops on Tensor Core
Turing: ~16 Tflops on FP32 core OR ~65 Tflops on Tensor Core
Matrix-multiplication, with FP32 input, FP32 accumulator:
Vega10: ~14 Tflops
Vega20: ~15 Tflops
Pascal: ~11 Tflops
Volta: ~14 Tflops
Turing: ~16 Tflops
Every chip has a limit on how high it can clock. Once you get closer to it the voltage requirements start to ramp up rapidly and the power consumption grows exponentially. Limit depends on both the architecture and the process the chips is built on (as well as individual variation between each chip). (also, GP104 and Vega 10 are not built on same or equal processes)I don't get that sentence, please elaborate.
Just like GP104 performance even out of the box, but I don't think I've heard a single one that didn't benefit from lowering voltage.That's chip lottery. Also it doesn't really guarantee 100% stability across all workloads.
On the other spectrum, many GP104 users are running their cards @2.1GHz, getting extra performance while still consuming far less power than Vega.