Nvidia Volta Speculation Thread

lanek · Jun 20, 2017

Nvidia announce (formally ) the Tesla V100 PCI express

- 5120 SP
- 1370 mhz
- 14Tflops FP 32 / 28 Tflops FP16
- 250W
Available: later this year..
( basically the same as SXM2 witout Nvlink, lower clock, lower TDP )

More infos:
http://www.anandtech.com/show/11559...ces-pcie-tesla-v100-available-later-this-year

pharma · Jun 23, 2017

Unified Memory for CUDA Beginners
June 19, 2017

On Pascal and later GPUs, the CPU and the GPU can simultaneously access managed memory, since they can both handle page faults; however, it is up to the application developer to ensure there are no race conditions caused by simultaneous accesses.
...
In our simple example, we have a call to cudaDeviceSynchronize() after the kernel launch. This ensures that the kernel runs to completion before the CPU tries to read the results from the managed memory pointer. Otherwise, the CPU may read invalid data (on Pascal and later), or get a segmentation fault (on pre-Pascal GPUs).

Starting with the Pascal GPU architecture, Unified Memory functionality is significantly improved with 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. The Page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages on demand from anywhere in the system to the GPU’s memory for efficient processing.

In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations for any code that is using Unified Memory for allocations (e.g. cudaMallocManaged()). It “just works” without any modifications to the application, whether running on one GPU or multiple GPUs.

Also, Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. This is useful in writing efficient multi-GPU cooperative algorithms.

Demand paging can be particularly beneficial to applications that access data with a sparse pattern. In some applications, it’s not known ahead of time which specific memory addresses a particular processor will access. Without hardware page faulting, applications can only pre-load whole arrays, or suffer the cost of high-latency off-device accesses (also known as “Zero Copy”). But page faulting means that only the pages the kernel accesses need to be migrated.

https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/#more-7937

Pinstripe · Jun 24, 2017

So according to Fudzilla, the next Geforce product line won't be based on Volta, but on "a Pascal influenced design derived shrink down":
http://fudzilla.com/news/graphics/43962-new-geforce-will-be-incremental

Alexko · Jun 24, 2017

This article, much like many from Fudzilla, uses a lot of words to say very little of any substance.

msia2k75 · Jun 24, 2017

Alexko said:
This article, much like many from Fudzilla, uses a lot of words to say very little of any substance.

In fact, did you understand what has been written? :-|

I personally didn't...

lanek · Jun 24, 2017

msia2k75 said:
In fact, did you understand what has been written?
I personally didn't...

Yes, and therres even some words that are, for the least misplaced ( "Geforce Fans " speaking aboout "Geforce product ". )..

This said, they said all and nothing and on the end, they dont make us learn anything..

- We know allready that Geforce gpu's, will not Include tensor processors. ( specially on G104 versions ), this dont make sense, and like for Maxwell and Pascal, no FP64 support ( well the minimum ). and even FP16 will surely been cut . So yes cores will be quite different. In fact it is clear that so far, the only "evolution " on the architecture we have seen is tensor, FP16 performances ... all for AI, DP, not much for gaming...

For them this is the reason they dont want to call it Volta , but this is a little bit like telling the 1080 is not based on Pascal.

So they dont say the next gaming gpu's will be based on Volta, but they dont say it will be formally just a "die shrinked " Pascal ...

Alexko · Jun 24, 2017

msia2k75 said:
In fact, did you understand what has been written?
I personally didn't...

It is vague enough that you could interpret it pretty much any way you wanted!

ieldra · Jun 24, 2017

Alexko said:
It is vague enough that you could interpret it pretty much any way you wanted!

Fairy standard stuff, make a vague enough statement and you can claim to have been right no matter what, there's an Italian website called Bits and Chips that does the exact same thing.

First they claim that Skylake-X will have a revision using soldered IHS, then they say that if Skylake-X sells well they might not release said revision.

No matter what happens, they 'predicted it' xD
Hilarious

Ryan Smith · Jun 26, 2017

lanek said:
In fact it is clear that so far, the only "evolution " on the architecture we have seen is tensor, FP16 performances ... all for AI, DP, not much for gaming...

What you've seen so far is what NVIDIA wants you to see. GV100's customers are compute users, so those are the abilities they're talking up. The graphics capabilities on the other hand? NVIDIA won't talk about that until they're good and ready to release graphics products.

What we know about Volta could fill a bucket. What we don't know about Volta could fill a pool.

Leier · Jun 30, 2017

lanek said:
So yes cores will be quite different. In fact it is clear that so far, the only "evolution " on the architecture we have seen is tensor, FP16 performances ... all for AI, DP, not much for gaming...

Thats not quite right.

Hello

NVIDIA claims - beside the FP16/FP64-Support, Tensorcore blabla - that the new shader architecture is 50% more power efficent for FP32 (=100% gaming shader calculations) than the previous pascal generation:

[URL said:
https://devblogs.nvidia.com/parallelforall/inside-volta/]The[/URL] new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope.

Maybe that is exaggerated, but even if it is only 30 or 40 percent, that is an huge increase. And i have no doubt that NVIDIA will bring this architecture into the gaming volta, everything else makes no sense at all. Why should they create another architecture, if they have such a big improvement?

Considering the process they can then choose to make the chip bigger (like volta) or increase clock speed.

CarstenS · Jun 30, 2017

Leier said:
Thats not quite right.

Hello

NVIDIA claims - beside the FP16/FP64-Support, Tensorcore blabla - that the new shader architecture is 50% more power efficent for FP32 (=100% gaming shader calculations) than the previous pascal generation:

Maybe that is exaggerated, but even if it is only 30 or 40 percent, that is an huge increase. And i have no doubt that NVIDIA will bring this architecture into the gaming volta, everything else makes no sense at all. Why should they create another architecture, if they have such a big improvement?

Considering the process they can then choose to make the chip bigger (like volta) or increase clock speed.

As long as no metric is given, that's a basically worthless statement.
It could, for example, be derived from the Peak TFLOPs numbers comparing P100 and V100 - with no indication how long those Turbo clocks can be sustained for each (and more importantly, how far the base clock is missed while under full load).

I am not saying, there is no improvement, but how large that is under defined workloads remains to be seen.

Leier · Jun 30, 2017

CarstenS said:
It could, for example, be derived from the Peak TFLOPs numbers comparing P100 and V100 - with no indication how long those Turbo clocks can be sustained for each (and more importantly, how far the base clock is missed while under full load).

If you compare P100 and V100 you can see that there is definitely a huge improvement between both chips. P100 was already power limited at 300W. V100 is also at 300W. But V100 has 50% higher TFlops in FP32+FP64 compared to P100. If there wouldn't be any improvement in Perf/Power, V100 would be just as fast as P100.

I don't know how much the difference in Perf/Power for "12nm" vs. 16nm at TSMC is. But i doubt it is 50% improvement.

CarstenS said:
I am not saying, there is no improvement, but how large that is under defined workloads remains to be seen.

Thats what I already said. And even half the claimed improvement of 50% would be a huge step forward for the volta generation compared to pascal considering it is still the same manufacturing node.

CarstenS · Jun 30, 2017

Again, you're comparing spec sheet performance here. Or do you have the respective Tesla cards at hand?

Leier · Jun 30, 2017

CarstenS said:
Again, you're comparing spec sheet performance here.

Yes. Why not?

Malo · Jun 30, 2017

Leier said:
Yes. Why not?

Because "tflops" don't translate into direct comparison performance for real world use. Also V100 is a much larger chip with more SM's.

CarstenS · Jun 30, 2017

Leier said:
Yes. Why not?

Because there is a difference if you can hit your boost and even hold your base clocks for mere seconds or sustain them for hours. Which it ultimately is remains to be seen and up until then, I remain skeptical of a global statement of 50 (or 30 or 20) percent higher energy efficiency.

Leier · Jun 30, 2017

@Malo/Carstens:

If you have propor work load (shader-limited) and don't have any other bottlenecks, then you should see a 1:1 increase. I don't recall anything about people using P100 complaining that it does not reach the expected speeds.

Malo said:
Also V100 is a much larger chip with more SM's.

Yes, but the chip was power limited. Without power Efficiency improvements V100 would be just as fast a P100.

Malo · Jun 30, 2017

Leier said:
Without power Efficiency improvements V100 would be just as fast a P100.

This is getting off topic but that's simply not true. P100 has 3584 CUDA cores, V100 has 5120. If you said the per-SM efficiency is higher then sure, once V100 is out for testing then it can be shown that it is the case. But you're claiming higher TFLOPs based on efficiency alone.

Your statement was thus:

If there wouldn't be any improvement in Perf/Power, V100 would be just as fast as P100.

Which going by CUDA core count which determines the TFLOP rating, this is simply not true.

Leier · Jun 30, 2017

Malo said:
This is getting off topic but that's simply not true. P100 has 3584 CUDA cores, V100 has 5120. If you said the per-SM efficiency is higher then sure, once V100 is out for testing then it can be shown that it is the case. But you're claiming higher TFLOPs based on efficiency alone.

No, I am not. Please read my postings more carefully. My point it: P100 is power limited at 300W. If you make a P100 with 50% more cores (=V100) you still would not get one more Flops - because you are power limited.

Only with increased Perf/Power you are capable to get more Perf at the same Power.

Benetanegia · Jun 30, 2017

Malo said:
Which going by CUDA core count which determines the TFLOP rating, this is simply not true.

How not? You have more CUDA cores working at the same TDP. How does this not imply better perf/watt? Similar clocks also, so we are not talking, for example, about a 50% increase from a 100% increase in core count.

Nvidia Volta Speculation Thread

lanek

pharma

Pinstripe

Alexko

msia2k75

lanek

Alexko

ieldra

Ryan Smith

Leier

CarstenS

Moderator

Leier

CarstenS

Moderator

Leier

Malo

Yak Mechanicum

CarstenS

Moderator

Leier

Malo

Yak Mechanicum

Leier

Benetanegia

Similar threads