Nvidia Volta Speculation Thread

I don't know how important this kind of stuff is in terms for new algorithms.
But if it is and it sees adoption, AMD better make sure to add the same feature.
 
I have contacted a local GPU supplier in Beijing, the listed price of a Tesla V100 is abit cheaper than I thought, and it will become availble sooner as well.

Tesla V100 PCIE will be charged about the same amount of money as Tesla P100 PCIE at launch, and about 10%-20% more expensive comparing to it is now, and it will become aviable in China next month.

As for FP16 rate, according to their tech manager, it seems that besides Tensor core's mixed precision computation, V100 dont have 2x FP16 rate like P100 does, this is also confirmed in CUDA 9.0RC's programming guide: their FP16 rate is the same as their FP32 rate, so in V100 Nvidia move all its low precision DL stuff into tensor cores (with better precision), I think thats a good idea.
 
Last edited:
Volta actually fixes livelocks, not deadlocks. Take a simplified producer/consumer example like this:
Code:
*flag = 1;
if(consumer) {
  while(*flag) {} // spinloop
} else { // producer
  *flag = 0;
}

On existing GPUs if the producer and consumer are in the same warp, and the consumer (the while loop) executes first, the consumer will simply sit and spin forever. If the producer and consumer are on different warps everything works fine.

For Volta this Just Works, no matter where threads are or what order the program executes in.
 
Couldn't it be interpreted that the producer thread is waiting for the consumer thread to yield execution priority? The two threads are waiting for the other thread to do something, even if not through the same variable. The producer thread is static as far as its state goes, so would it be considered live enough for livelock?
 
Couldn't it be interpreted that the producer thread is waiting for the consumer thread to yield execution priority? The two threads are waiting for the other thread to do something, even if not through the same variable. The producer thread is static as far as its state goes, so would it be considered live enough for livelock?
The producer waiting for the consumer to yield execution is exactly what's happening. The consumer thread will continuously read *flag, see it's 1, read it again, see it's 1. The producer never executes. The system as a whole is executing instructions so the system is live, but no thread is making any forward progress so the system is livelocked.

Deadlock would be if all threads in the warp entirely stopped issuing instructions. This should not be possible on any GPU, unless there's a HW scheduler bug.
 
Deadlock would be if all threads in the warp entirely stopped issuing instructions. This should not be possible on any GPU, unless there's a HW scheduler bug.
Thanks for the explanation of the difference between the two, but if we reasonably assume that most CPUs and GPUs that are on the market have no crippling scheduling bugs, doesn't that mean that everything that we call deadlock is actually a lifelock (which the exception of, say, embedded systems where a CPU with a HALT instruction that can only be woken up with an interrupt of some sort) ?

What I mean is: deadlock or lifelock, for the user the system is halted and dead. :)
 
I looked up the differences between deadlock and livelock:
https://en.m.wikipedia.org/wiki/Deadlock

IMO, from a warp point of view, it's a deadlock. There is no change is state for the warp.
You could argue that it's a livelock from an SM point of view since other warps can still progress but that will eventually stop when available warps all complete, at which point the whole GPU is also in deadlock.
 
Point. From a software or program point of view it's deadlocked since no state is changing. From a hardware point of view it's livelocked since instructions are being issued. I'm a hardware guy so I think more on that side.

But whatever you call it I think that producer/consumer example is the simplest way to understand the issue.
 
You really are obsessed with Polygons.
There are three possibilities, since synthetic tests show only marginal differences between GP104 and GP102:
A) There's a block in the driver to keep vertex data from flooding the pipeline, much like it was implemented in R600 in post-launch drivers, where the 50x geometry advantage over competing solutions just evaporated.
B) There's a single hardware block limiting the throughput, which is not scaled/scalable between GP104 and GP102.
C) There's another limitation somewhere in the pipeline, that results in the same constraints for GP102 and GP104. The only thing I can think of here is transfer rate between individual L2 partitions, since they need to share information about the triangles for Nvidias distributed geometry approach, where each SM fetches one vertex every other or every third clock cycle for processing.

I'm open for other suggestions of course.
 
As for FP16 rate, according to their tech manager, it seems that besides Tensor core's mixed precision computation, V100 dont have 2x FP16 rate like P100 does, this is also confirmed in CUDA 9.0RC's programming guide: their FP16 rate is the same as their FP32 rate, so in V100 Nvidia move all its low precision DL stuff into tensor cores (with better precision), I think thats a good idea.

This puts to rest the question of where Volta's FP16 throughput comes from. The 30 TFLOPs FP16 come from the tensor cores.

Unless I'm reading it wrong and Volta only does 15 TFLOPs unless they're specific tensor operations..? I can't find any slide claiming 30 TFLOPs FP16 in Volta, only Anandtech's article says so in their tables.

I guess this also means the tensor cores can't work in parallel with the FP32 units (which should be able to work FP16 as well through promotion).
 
Anandtech's piece was written immediately upon launch, maybe information was not as clear cut at that point. To my knowledge neither Cuda 9.0 RC Doc nor the Volta whitepaper hint at double-rate FP16 in other contexts as Pascal GP100. Table 1 on page 10 does only give number for FP32, FP64 and Tensor-FLOPS.

Maybe - and this is just a wild guess, tensor cores will stay GV100 exklusive, while gaming-oriented brethren get 2× rate FP16 to partially compensate for the lack of tensor cores. Or Nvidia will completely ignore FP16 for gaming, which would be interesting to see once first games make use of it.
 
Volta actually fixes livelocks, not deadlocks. Take a simplified producer/consumer example like this:
Code:
*flag = 1;
if(consumer) {
  while(*flag) {} // spinloop
} else { // producer
  *flag = 0;
}

On existing GPUs if the producer and consumer are in the same warp, and the consumer (the while loop) executes first, the consumer will simply sit and spin forever. If the producer and consumer are on different warps everything works fine.

For Volta this Just Works, no matter where threads are or what order the program executes in.

Just to make things interesting, the above code may or may not actually terminate - it all has to do with whether the compiler decides to schedule the then or the else first. If it schedules the else first, then the producer will release the lock and the consumer will happily go on to the next step. Otherwise, deadlock. Since ordering of mutually exclusive branches, only meaningful in SIMT processors, is undefined, who knows what will happen. People have reported that in Cuda the else clause is in fact executed first. Unless the optimizer decides to... Yeah, there's a reason that inter-warp locks produce very brittle code in the very best case.
 
News of Volta at Hot Chips 2017:
https://www.servethehome.com/nvidia-v100-volta-update-hot-chips-2017/
Overall perf update:
NVIDIA-V100-and-P100-Comparison.jpg


and Interesting architecture slides below
NVIDIA-Volta-GV100-SM.jpg


NVIDIA-Volta-V100-SM-Microarchitecture.jpg


NVIDIA-Volta-V100-Sub-Core.jpg


NVIDIA-Volta-V100-Shared-Memory.jpg
 
Toms Hardware just published his ow article about Volta presentation at Hot Chips:
http://www.tomshardware.com/news/nvidia-volta-gv100-gpu-ai,35297.html
The Volta die resides on a block of steel, so the GV100 has quite a bit of heft to it. Nvidia equipped the bottom of the GV100 with two mezzanine connectors. One connector primarily serves typical PCIe traffic, and the other is dedicated to NVLink connections. The GV100 modules are secured to custom boards (Nvidia offers its HGX reference board) via eight fasteners, and the boards reside inside server chassis of varying heights.

A hefty array of 16 inductors and voltage regulators line the edge of the card. The package pulls an average of 300W at a little below 1V, so over 300A flows into the die. Nvidia provides reference cooling designs, but most of its HPC customers opt for custom liquid cooling solutions, while many hyperscalers go with air cooling. The thermal solution attaches to the four silver-edged holes next to the die.

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9RLzYvNzA0NjcwL29yaWdpbmFsL0lNR18wOTk1LmpwZw==

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9RLzQvNzA0NjY4L29yaWdpbmFsL0lNR18xMDIzLmpwZw==
 
Good podcast about Volta and CUDA C++ (interesting stuff starts around 20 minutes):
https://player.fm/series/cppcast/volta-and-cuda-c-with-olivier-giroux

Rob and Jason are joined by Olivier Giroux from NVidia to talk about programming for the Volta GPU.

Olivier Giroux has worked on eight GPU and four SM architecture generations released by NVIDIA. Lately, he works to clarify the forms and semantics of valid GPU programs, present and future. He was the programming model lead for the new NVIDIA Volta architecture. He is a member of WG21, the ISO C++ committee, and is a passionate contributor to C++'s forward progress guarantees and memory model.
 
Back
Top