Nvidia Volta Speculation Thread

Unified Memory for CUDA Beginners
June 19, 2017
On Pascal and later GPUs, the CPU and the GPU can simultaneously access managed memory, since they can both handle page faults; however, it is up to the application developer to ensure there are no race conditions caused by simultaneous accesses.
...
In our simple example, we have a call to cudaDeviceSynchronize() after the kernel launch. This ensures that the kernel runs to completion before the CPU tries to read the results from the managed memory pointer. Otherwise, the CPU may read invalid data (on Pascal and later), or get a segmentation fault (on pre-Pascal GPUs).

Starting with the Pascal GPU architecture, Unified Memory functionality is significantly improved with 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. The Page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages on demand from anywhere in the system to the GPU’s memory for efficient processing.

In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations for any code that is using Unified Memory for allocations (e.g. cudaMallocManaged()). It “just works” without any modifications to the application, whether running on one GPU or multiple GPUs.

Also, Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. This is useful in writing efficient multi-GPU cooperative algorithms.

Demand paging can be particularly beneficial to applications that access data with a sparse pattern. In some applications, it’s not known ahead of time which specific memory addresses a particular processor will access. Without hardware page faulting, applications can only pre-load whole arrays, or suffer the cost of high-latency off-device accesses (also known as “Zero Copy”). But page faulting means that only the pages the kernel accesses need to be migrated.
https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/#more-7937
 
In fact, did you understand what has been written? :-|
I personally didn't...

Yes, and therres even some words that are, for the least misplaced ( "Geforce Fans " speaking aboout "Geforce product ". )..

This said, they said all and nothing and on the end, they dont make us learn anything..

- We know allready that Geforce gpu's, will not Include tensor processors. ( specially on G104 versions ), this dont make sense, and like for Maxwell and Pascal, no FP64 support ( well the minimum ). and even FP16 will surely been cut . So yes cores will be quite different. In fact it is clear that so far, the only "evolution " on the architecture we have seen is tensor, FP16 performances ... all for AI, DP, not much for gaming...

For them this is the reason they dont want to call it Volta , but this is a little bit like telling the 1080 is not based on Pascal.

So they dont say the next gaming gpu's will be based on Volta, but they dont say it will be formally just a "die shrinked " Pascal ...
 
Last edited:
It is vague enough that you could interpret it pretty much any way you wanted!

Fairy standard stuff, make a vague enough statement and you can claim to have been right no matter what, there's an Italian website called Bits and Chips that does the exact same thing.

First they claim that Skylake-X will have a revision using soldered IHS, then they say that if Skylake-X sells well they might not release said revision.

No matter what happens, they 'predicted it' xD
Hilarious
 
In fact it is clear that so far, the only "evolution " on the architecture we have seen is tensor, FP16 performances ... all for AI, DP, not much for gaming...
What you've seen so far is what NVIDIA wants you to see. GV100's customers are compute users, so those are the abilities they're talking up. The graphics capabilities on the other hand? NVIDIA won't talk about that until they're good and ready to release graphics products.

What we know about Volta could fill a bucket. What we don't know about Volta could fill a pool.:eek:
 
So yes cores will be quite different. In fact it is clear that so far, the only "evolution " on the architecture we have seen is tensor, FP16 performances ... all for AI, DP, not much for gaming...
Thats not quite right.

Hello :)

NVIDIA claims - beside the FP16/FP64-Support, Tensorcore blabla - that the new shader architecture is 50% more power efficent for FP32 (=100% gaming shader calculations) than the previous pascal generation:
[URL said:
https://devblogs.nvidia.com/parallelforall/inside-volta/]The[/URL] new Volta SM is 50% more energy efficient than the previous generation Pascal design, enabling major boosts in FP32 and FP64 performance in the same power envelope.
Maybe that is exaggerated, but even if it is only 30 or 40 percent, that is an huge increase. And i have no doubt that NVIDIA will bring this architecture into the gaming volta, everything else makes no sense at all. Why should they create another architecture, if they have such a big improvement?

Considering the process they can then choose to make the chip bigger (like volta) or increase clock speed.
 
Thats not quite right.

Hello :)

NVIDIA claims - beside the FP16/FP64-Support, Tensorcore blabla - that the new shader architecture is 50% more power efficent for FP32 (=100% gaming shader calculations) than the previous pascal generation:

Maybe that is exaggerated, but even if it is only 30 or 40 percent, that is an huge increase. And i have no doubt that NVIDIA will bring this architecture into the gaming volta, everything else makes no sense at all. Why should they create another architecture, if they have such a big improvement?

Considering the process they can then choose to make the chip bigger (like volta) or increase clock speed.
As long as no metric is given, that's a basically worthless statement.
It could, for example, be derived from the Peak TFLOPs numbers comparing P100 and V100 - with no indication how long those Turbo clocks can be sustained for each (and more importantly, how far the base clock is missed while under full load).

I am not saying, there is no improvement, but how large that is under defined workloads remains to be seen.
 
It could, for example, be derived from the Peak TFLOPs numbers comparing P100 and V100 - with no indication how long those Turbo clocks can be sustained for each (and more importantly, how far the base clock is missed while under full load).
If you compare P100 and V100 you can see that there is definitely a huge improvement between both chips. P100 was already power limited at 300W. V100 is also at 300W. But V100 has 50% higher TFlops in FP32+FP64 compared to P100. If there wouldn't be any improvement in Perf/Power, V100 would be just as fast as P100.

I don't know how much the difference in Perf/Power for "12nm" vs. 16nm at TSMC is. But i doubt it is 50% improvement.

I am not saying, there is no improvement, but how large that is under defined workloads remains to be seen.
Thats what I already said. And even half the claimed improvement of 50% would be a huge step forward for the volta generation compared to pascal considering it is still the same manufacturing node.
 
Again, you're comparing spec sheet performance here. Or do you have the respective Tesla cards at hand?
 
Yes. Why not?
Because there is a difference if you can hit your boost and even hold your base clocks for mere seconds or sustain them for hours. Which it ultimately is remains to be seen and up until then, I remain skeptical of a global statement of 50 (or 30 or 20) percent higher energy efficiency.
 
@Malo/Carstens:

If you have propor work load (shader-limited) and don't have any other bottlenecks, then you should see a 1:1 increase. I don't recall anything about people using P100 complaining that it does not reach the expected speeds.

Also V100 is a much larger chip with more SM's.
Yes, but the chip was power limited. Without power Efficiency improvements V100 would be just as fast a P100.
 
Without power Efficiency improvements V100 would be just as fast a P100.
This is getting off topic but that's simply not true. P100 has 3584 CUDA cores, V100 has 5120. If you said the per-SM efficiency is higher then sure, once V100 is out for testing then it can be shown that it is the case. But you're claiming higher TFLOPs based on efficiency alone.

Your statement was thus:
If there wouldn't be any improvement in Perf/Power, V100 would be just as fast as P100.

Which going by CUDA core count which determines the TFLOP rating, this is simply not true.
 
This is getting off topic but that's simply not true. P100 has 3584 CUDA cores, V100 has 5120. If you said the per-SM efficiency is higher then sure, once V100 is out for testing then it can be shown that it is the case. But you're claiming higher TFLOPs based on efficiency alone.
No, I am not. Please read my postings more carefully. My point it: P100 is power limited at 300W. If you make a P100 with 50% more cores (=V100) you still would not get one more Flops - because you are power limited.

Only with increased Perf/Power you are capable to get more Perf at the same Power.
 
Back
Top