Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.
Surely 13b vs 21b transistors wouldn't be comparable regardless of the process used?
Same market, yes, but a similar price would be very surprising.
Maybe for evaluating price and cost (which are really far less important aspects to discuss in a tech forum), but for performance, architectural efficiency, power efficiency and scalability on different nodes, not so much.
You have 13b transitions consuming 300W on 7nm, vs 21b transistors doing the same thing on 12nm (basically 16nm). How is that irrelevant?
Watts doesn't matter. It's flops/watt (and flops/mm², etc), or rather, usable work per watt that matters. If a 300W part on 7nm with 21 gigatrannies can render 50 megayums of graphical lovelies, and a 300w part on 16 nm with 13 gigatrannies can render 40 megayums of graphical lovelies, the 7nm part is more effective.
Assuming you are choosing a part based on power efficiency. You may choose output per $.
Regardless, watts consumed means very little without useful benchmarks to compare workload.
Right, and thats why at 7nm a Nvidia chip at 150 watts would trounce an AMD part at 150 watts, you would get 2/3 more fps. And if Nvidia was different its 150 watts chips would be in both Sony and MS consoles.
The more transistors you can put in a chip in a given TDP, the more performance and features you can have out of the chip. Volta is large because it's on an old node, and because it has extra Tensor Cores which enables it to have vastly more AI performance than MI60. It's also vastly more powerful in rasterization. It has more ROPs, TMUs and polygon throughput than Vega 20. Even a TitanV is (which is a cut down chip) is at least 50% faster in rasterization.
Seeing this situation, NVIDIA can double the RTX hardware on 7nm, considerably increase their rasterization hardware, maintain chip size, and have extra features all while staying on the same power envelope.
Agreed, that's why I included various performance aspects into that discussion.
But first it would have to exist.
I was using the Volta numbers. IIRC, Turing cut down on the Tensor core count by 1/2. It's a tradeoff - fewer Tensor cores means more of something else.
Except that metric is largely useless because of power curves. Really need to equalize for mm2 to compare architectures, and even that can be tricky. SRAM for example is transistor dense so transistors may not be valid for comparison. Fabs and process can vary comparisons a bit.
Practically any sized chip can be made to consume 150W and the chip with more area and lower clocks/voltages will almost always be more efficient on a similar node for parallel processing. Doubling processors will double performance at roughly twice the power. Double clocks and power explodes as the curve is exponential.
With the die size in that comparison, you're probably looking at 3-4 Vega 20s versus a single Volta. Equalize power and the Vegas may be more efficient and offer far more bandwidth.
It's still the same number of cores. Respectively what they could fit within the power budget, assuming a primarily one-sided workload stressing only one core type to the limit simultaneously.
Then there has also been the T4 announcement, which runs at full core count, but only half the clock, for half the Tensor Core throughput of the consumer cards.
So to sum that topic up, for the fastest chip in each generation (Vega64, MI60, Tesla V100, Quadro RTX 6000). All according to published data sheets, this time.
Matrix-multiplication, with FP16 input, FP16 accumulator:
Vega10: ~28 Tflops
Vega20: ~30 Tflops
Pascal: ~22 Tflops
Volta: ~30 Tflops on FP32 core OR ~120 Tflops on Tensor Core
Turing: ~33 Tflops on FP32 core OR ~130 Tflops on Tensor Core
Matrix-multiplication, with FP16 input, FP32 accumulator:
Vega10: ~14 Tflops
Vega20: ~30 Tflops (to be confirmed, but likely)
Pascal: ~11 Tflops
Volta: ~15 Tflops on FP32 core OR ~120 Tflops on Tensor Core
Turing: ~16 Tflops on FP32 core OR ~130 Tflops on Tensor Core (only 57 Tflops on GeForce)
Matrix-multiplication, with FP32 input, FP32 accumulator:
Vega10: ~14 Tflops
Vega20: ~15 Tflops
Pascal: ~11 Tflops
Volta: ~14 Tflops
Turing: ~16 Tflops
Before you do any die size comparison, you should equalize the node first.
Volta offers 900GB of HBM2 bandwidth, vs 1TB for Vega 20, not a huge difference. Volta PCIE also requires only 250W. Less than Vega 20 by 50W. It's more power efficient even on the 12/16nm node. At 7nm it will consume far less power at same clocks (125w?) and has it's size shrunk from 815 to at most 600mm2.
That's an inaccurate generalization. Vega 10 is more wide, has bigger area and lower clocks, yet it consumes far more power than GP104, which is clocked to the max. You are excluding power efficiency for a given architecture, which is the more determining factor really.
We are not talking about a situation where the clock is doubled here.
I am assuming you are using the lower clocked PCIE V100, right?
If you base your numbers on the NVLink version of V100, the Volta value should be 15.7 TF.
If 14nm -> 7nm die didn't bring significant power savings for AMD, it won't for nVidia either. Especially if TSMC's 16/12nm is already better than GloFo's 14nm. Ie. if nVidia clocks their 7nm chips 10-15% higher than Volta/Turing, they will also consume 250-300W
Which is exactly why I stated just that in my post.
Except the comparison is against multiple Vegas with 2-3x or more aggregate bandwidth. Your whole power argument is plain silly because of the curves and parallel nature of the workload. Could probably make 10 Vegas use less power than Volta with more performance if you wanted. Doubt that's cost effective unless memory bandwidth crucial.
Again, power curves which was the entire point of my post. Dial it back a bit and more FLOPs for less power. Not even accounting for node or other factors. Which again as I stated in my post need to be taken into account.
Sure we are when you consider power curves and the theoretical model. Cut the clocks in half and power plummits. Perf/watt skyrocketing in the process. Did I mention power curves? Just want to make sure the premise of my post wasn't misunderstood.
Of course It does bring savings to Vega 20, problem is these savings are offset by the clock increase, IO additions and extra chip features. Things Volta already paid for on 12nm.
We are not arguing theoreticals here, but actual implementations. I can claim you can make 20 Voltas consume 100w. Doesn't mean it's true, or it's doable in any practical or useful manner.
And that's your problem right there, you are talking in a vacuum, consider other factors like arc power efficiency, nodes, features .. etc. And your power curve point suddenly becomes moot. As it actually applies to all architectures.
If the clocks we're adjusted, that would be the implementation. No requirement to run the cards at stock settings. 20 Voltas probably won't work with the power floor, but yes the same analogy would work if Voltas were significantly cheaper than the Vegas. Last I checked they weren't as we're comparing a die a fraction of the size.
I did consider them in the last two posts I made and explicitly pointed it out. Not sure I see the problem other than you missing the comparison.
Vega 10 is clocked and using voltages far closer to actual limit of the chip (which is combination of architecture and process) than any model of GP104. They don't use equal processes either.
For what it's worth, many users actually achieve higher performance and lower consumption with their Vegas by simply lowering the voltage a tad
That's chip lottery. Also it doesn't really guarantee 100% stability across all workloads.
On the other spectrum, many GP104 users are running their cards @2.1GHz, getting extra performance while still consuming far less power than Vega.
I don't get that sentence, please elaborate.
Tensor core fp32 accumulation on volta and quadro turing is full speed. 2080ti is half speed and that's probably where confusion comes from. Turing also has 8bit and 4bit tensor cores.
Some interesting tidbits from this video, about Vega 20.
Bandwidth is better explained at ~~22m00s
GPU-to-GPU total bandwidth is 200GB/s from the IF link, plus 64GB/s through the 16 PCIe 4.0 lanes, so a total of 264GB/s.
He claims the latency of Vega 20 A to access Vega B's memory pool is 60-70ns, and one "complete loop" between neighboor GPUs is 140-170ns.
Every chip has a limit on how high it can clock. Once you get closer to it the voltage requirements start to ramp up rapidly and the power consumption grows exponentially. Limit depends on both the architecture and the process the chips is built on (as well as individual variation between each chip). (also, GP104 and Vega 10 are not built on same or equal processes)
Vega models are clocked and are using voltages far closer to actual limits of the chip and chosen so that every card should be able to meet the advertized clockspeeds at pre-set situation even if it means using higher voltage on many of the cards. Being close to it's limits, lowering the voltage and/or clocks a bit makes a huge difference on power consumption here.
This behaviour is nicely demonstrated for example in TPU's Vega 64 review (https://www.techpowerup.com/reviews/AMD/Radeon_RX_Vega_64/)
Using (primary) Balanced profile has GPU consumption limit at 220W, Turbo-profile at 253W and Power Saver -profile at 165W. Vega being already so close to it's limits, you can reach about 1 % higher performance with 15 % higher consumption. On the other hand for the very same reason, using Power Saver -profile cuts your performance by only 4 % while your power limit goes down by 25 % (in other words losing 4 % performance gives you 33% higher energy efficiency). No card running in a "comfortable range" would have such extreme differences between the profiles
NVIDIA on the other hand has had the luxury to be more moderate with their clocks and thus voltage, they had a lot of headroom on both the clocks and the voltage to go higher, but they didn't need to. Being in more comfortable, dare I even say optimal clockrange for the chip, lowering the voltage and/or clocks a bit makes a smaller difference here.
Just like GP104 performance even out of the box, but I don't think I've heard a single one that didn't benefit from lowering voltage.