Nvidia Volta Speculation Thread

No, I am not. Please read my postings more carefully. My point it: P100 is power limited at 300W. If you make a P100 with 50% more cores (=V100) you still would not get one more Flops - because you are power limited.
Ah ok yes I didn't read that part properly. Indeed, you are likely right that they couldn't achieve the higher core count unless efficiency was gained as well allowing them to utilize more cores in the same power envelope. Sorry, my bad.
 
If you make a P100 with 50% more cores (=V100) you still would not get one more Flops - because you are power limited.
Example to the contrary: Kepler, namely
GK110: Titan Black, 250W, 5644,8 GFLOPS@980 MHz, 22,579 GFLOP/Watt
GK104: GTX770, 230W, 3333 GFLOPS@1084 MHz, 14,49 GFLOP/Watt

The math is simple and very simplified like this: the more units you employ, the lesser the frequency you need to hit for the same throughput. And the lower your target frequency until a certain lower boundary per process node (NTV), the lower the voltage needed. Which is influencing power consumption much more than frequency alone.

Further proof: Fury X vs. Fury Nano. Or HD 7970 vs. HD 7950 (here: not TDPs but real world power consumption).
To my knowledge, Nvidia did only give boost clocks for V100, not base clocks. Hence my scepticism.
 
Example to the contrary: Kepler, namely
GK110: Titan Black, 250W, 5644,8 GFLOPS@980 MHz, 22,579 GFLOP/Watt
GK104: GTX770, 230W, 3333 GFLOPS@1084 MHz, 14,49 GFLOP/Watt
Sorry, Kepler is in my opinion way too old to compare it to pascal/volta.

But your case does not work because the 770 was not power limited, in contrary it never hit the power limit. So of course a bigger chip is capable of putting out more Flops. But think about the 770 is - like P100 - hitting the power limit of 250W. And then you have a GK100 with 90% more shaders at the same clockrate. How should that work? Where comes to power from?
 
Sorry, Kepler is in my opinion way too old to compare it to pascal/volta.
Physics and math have changed in the meantime? Ok.

But your case does not work because the 770 was not power limited, in contrary it never hit the power limit.
Take 770 and 780, if you will. Both did not hit the power limit, 780 still cranked out more FLOPS per watt.

So of course a bigger chip is capable of putting out more Flops.
That's not what I was saying. It can crank out the same amount of FLOPS usually at lower power draw because of what I wrote (short: more units - lower clocks, lower clocks - lower voltage, lower voltage - lower power)

But think about the 770 is - like P100 - hitting the power limit of 250W. And then you have a GK100 with 90% more shaders at the same clockrate. How should that work? Where comes to power from?
No one talked about the same clockrates except for you. Maybe that's where our misunderstanding lies.
 
Last edited:
I don't know how much the difference in Perf/Power for "12nm" vs. 16nm at TSMC is. But i doubt it is 50% improvement.
As Carsten said, perf/watt is a function of clockspeed that isn't linear. Adding SMs at a lower clock will increase perf/watt and Volta is a much larger chip. Most of Voltas gains wouldn't seem to be from per SM efficiency gains.

Very simply put, perf/watt is number of processors divided by the square of clockspeed.
 
GV100 has 40% more compute units than GP100. nVidia advertises that V100 delivers 42% (SMX2) and 50% (PCIe) more compute performance. Maybe we can just accept that most gains come from the new SM...
As it was claimed, that the 12FFN process of GV100 delivers 25% lower power at the same performance compared to 16FF, we can also conclude the gain may come to pretty equal parts from both, design and process.
 
The Fudizilla report may be borderline unreadable, but I don't think it's that unreasonable, since something similar happened with the pascal generation.

With GP100, in addition to the oft-quoted dp units nvidia halved the sm size, doubled the number of registers per scheduler, added packed half precision math and increased shared memory (less per sm but more per core).
(in fact, if we assume nvidia is using 8T SRAM, the larger register files alone account for half a billion of the 3 billion transistor difference between GP100 and GP102)
The extra registers were presumably added to allow for more occupancy and more memory requests in flight for the HBM2 memory. On the gddr5 variants nvidia apparently thought that wasn't necessary, and all other pascal chips were essentially maxwell with better texture compression.

Now Volta does add some features that could also be useful for gaming, especially the capability to co-issue int and fp operations, but nvidia will have to evaluate whether this is worth it on a perf/mm2 basis.

In the end, it's not necessarily a bad thing for gamers, after all, was most people are interested in is perf/$.
 
As it was claimed, that the 12FFN process of GV100 delivers 25% lower power at the same performance compared to 16FF, we can also conclude the gain may come to pretty equal parts from both, design and process.
Where did you get this 25% process improvement? Nvidia custom 12nm is refined 16nn with 12 being only a marketing reality... I've heard that 12nm brings less than 10% power saving
 
As it was claimed, that the 12FFN process of GV100 delivers 25% lower power at the same performance compared to 16FF, we can also conclude the gain may come to pretty equal parts from both, design and process.

That's not right. What we know is the 12FFC delivers 25% lower power than 16FFC. But there already the struggle begins. It's hard to find comparisons of 16FFC to 16FF+, but 16FF+ should be better performing, as 16FFC is more of a low cost process. So we don't have a real comparison of these to and additionally we don't know how 12FFN compares to 12FFC.
 
That's not right. What we know is the 12FFC delivers 25% lower power than 16FFC. But there already the struggle begins. It's hard to find comparisons of 16FFC to 16FF+, but 16FF+ should be better performing, as 16FFC is more of a low cost process. So we don't have a real comparison of these to and additionally we don't know how 12FFN compares to 12FFC.
Isn't FFC newer than FF+, and designed to offer the same power/performance with lower price? Similar to 14nm LPC?
 
Isn't FFC newer than FF+, and designed to offer the same power/performance with lower price? Similar to 14nm LPC?
FFC is designed for lower price and (as a result) higher transistor density, albeit with lower power/performance. Otherwise FF+ wouldn't really need to exist.
 
Last year they claimed the gains in performance as "architecture" when they just straight used the benefits of the trasition from 28nm to 16nm to get the performance targets. So I would not be surprised if they are doing the same thing again with 12nm, since there is no shrink all the benefits must be on the performance/power side.
 
That's not right. What we know is the 12FFC delivers 25% lower power than 16FFC. But there already the struggle begins. It's hard to find comparisons of 16FFC to 16FF+, but 16FF+ should be better performing, as 16FFC is more of a low cost process. So we don't have a real comparison of these to and additionally we don't know how 12FFN compares to 12FFC.
By the looks of things 12FFN doesn't seem to be anything but 12FFC risk production
 
Dont know where to put this information, but Nvidia is working on a MCM (Multi Chip Module) design that will come on 7nm process. The white paper below was presented last month at ISC17:
http://research.nvidia.com/sites/de...U:-Multi-Chip-Module-GPUs//p320-Arunkumar.pdf
Interesting info from this paper:
- Nvidia seems to look at a big monolithic die of 128 SM on 7nm
- The projected MCM GPU is 4 dies that bring a whooping total of 256 SM

Extract from the paper:
Many of today’s important GPU applications scale well with GPU compute capabilities and future progress in many fields such as exascale computing and artificial intelligence will depend on continued GPU performance growth. The greatest challenge towards building more powerful GPUs comes from reaching the end of transistor density scaling, combined with the inability to further grow the area of a single monolithic GPU die. In this paper we propose MCM-GPU, a novel GPU architecture that extends GPU performance scaling at a package level, beyond what is possible today. We do this by partitioning the GPU into easily manufacturable basic building blocks (GPMs), and by taking advantage of the advances in signaling technologies developed by the circuits community to connect GPMs on-package in an energy efficient manner. We discuss the details of the MCM-GPU architecture and show that our MCM-GPU design naturally lends itself to many of the historical observations that have been made in NUMA systems. We explore the interplay of hardware caches, CTA scheduling, and data placement in MCM-GPUs to optimize this architecture. We show that with these optimizations, a 256 SMs MCM-GPU achieves 45.5% speedup over the largest possible monolithic GPU with 128 SMs. Furthermore, it performs 26.8% better than an equally equipped discrete multi-GPU, and its performance is within 10% of that of a hypothetical monolithic GPU that cannot be built based on today’s technology roadmap.

Navi won't be alone...
 
Physics and math have changed in the meantime? Ok.
No. The architecture, especially the power management.

Take 770 and 780, if you will. Both did not hit the power limit, 780 still cranked out more FLOPS per watt.
770: 1046 MHz, 780: 863 Mhz - 20% clockspeed difference
P100: 1455 MHz, V100: 1480 MHz - 2% clockspeed difference.

your example just sucks, sorry to tell you that.

That's not what I was saying. It can crank out the same amount of FLOPS usually at lower power draw because of what I wrote (short: more units - lower clocks, lower clocks - lower voltage, lower voltage - lower power)
Of course you could run the shaders in the volta architecture with lower clock speeds. But besides being power limited you are also space limited, see the die size from V100. So what do you gain, when you clock 20% slower but then the chip with 20% more shaders (to compensate for lower clock speeds) is too big to sell for a profit?

No one talked about the same clockrates except for you. Maybe that's where our misunderstanding lies.
P100 and V100 have basically identical clock rates. So you have to talk about same clockrates, too.
 
I'll make this short, since your post does not make much sense to me at all. You're mixing clock speeds, die sizes, GFLOPS and power only taking which your respective part of the argument need while dismissing all the other factors as well. Sorry to tell you that.

P100 and V100 have basically identical clock rates. So you have to talk about same clockrates, too.
They have basically identical Turbo clockrates - and that's what I've been talking about all the time: It's not about the short time maximum Turbo, but what the respective cards can crank out over time. To the best of my knowledge, Nvidias has not (yet) stated the relevant base clock for Tesla V100 nor have we seen tests as to what clockrates are sustainable under different loads.
 
I'll make this short, since your post does not make much sense to me at all. You're mixing clock speeds, die sizes, GFLOPS and power only taking which your respective part of the argument need while dismissing all the other factors as well. Sorry to tell you that.
Sorry, thats on you. I made my point about power consumption and Perf/Power. You came along with a totally different architecture and totally different clock speeds, which was not the topic we were discussing.

I have no doubt that NVIDIA will introduce a very big improvement in terms of power efficiency with volta.

They have basically identical Turbo clockrates
The comparison in terms of Turbo clockrates would have yielded the same results. So my point still stands. If you want to compare oranges to my stated apples, be my guest, but please telling me afterwards my arguments wouldn't make sense, when you are the one throwing oranges into the mix.
 
I'll make this short, since your post does not make much sense to me at all. You're mixing clock speeds, die sizes, GFLOPS and power only taking which your respective part of the argument need while dismissing all the other factors as well. Sorry to tell you that.


They have basically identical Turbo clockrates - and that's what I've been talking about all the time: It's not about the short time maximum Turbo, but what the respective cards can crank out over time. To the best of my knowledge, Nvidias has not (yet) stated the relevant base clock for Tesla V100 nor have we seen tests as to what clockrates are sustainable under different loads.

To my knowledge, no Nvidia card in the last 3+ generations has failed to reach its turbo clockrates. Any reason to believe V100 will be any different?

Any reason to doubt the various performance numbers provided so far?

hpc_app_perf_v100.png

image10.jpg
 
Back
Top