Nvidia Volta Speculation Thread

he throws in Tensor-OPs that are everything but general FLOPS into the mix.
What percentage of GV100 chips are ending up in deep learning data centers right now?

I wouldn't be surprised if it's 90% or more.

Since that's the application that can make heavy use of tensor cores, most customers are effectively seeing that kind of increase in FP16 FLOPS.
 
What percentage of GV100 chips are ending up in deep learning data centers right now?

I wouldn't be surprised if it's 90% or more.

Since that's the application that can make heavy use of tensor cores, most customers are effectively seeing that kind of increase in FP16 FLOPS.
The context is that he was comparing their TFLOPS output to the numbers listed in the Top 500 list, which is strictly FP64.
I might be wrong on this, but I cannot remember any of the Top 20 supercomputers in the Top 500 being devised specifically as an AI machine.
 
NVIDIA is claming a 50% Effiency Advantage in FP32 performance for Volta over Pascal
I believe the comment was regarding power efficiency, not that their FP32 performance is 50% higher.
 
The context is that he was comparing their TFLOPS output to the numbers listed in the Top 500 list, which is strictly FP64.
I might be wrong on this, but I cannot remember any of the Top 20 supercomputers in the Top 500 being devised specifically as an AI machine.
That's why only quoted the sentence that I quoted. ;-)
 
I believe the comment was regarding power efficiency, not that their FP32 performance is 50% higher.
Efficency = Power/Work. Thats what ai wrote. So they can achieve up to 50% more FP32-Perf with the same clockspeed, by adding more shader cores. Like the did in GV100.
 
Since Tensor won't be in a gaming card, what are the differences between Pascal and Volta except core count and clock speed ?
For the gaming card it's uncertain, but more cache, async compute, possibly improved thread divergence, and unified memory seem likely. Probably all the top tier DX12 features. The unified memory, packed math, and true bindless are a bit of an unknown if they segment pro features.
 
nvidia can repeat the Maxwell die sizes, 600mm2 Pascal with HBM2 could pack quite a punch, I doubt Volta wouldn't improve upon it enough to go 50% clear of Titan Xp.
 
I know as a fact that Geforce Volta samples are already working at Nvidia for few weeks. So it looks like that they want to maximize Pascal profit (because they feel no thread from Vega)

If this info is correct, then they are not holding volta back. Samples in labs since few weeks means launch next year. It's just the normal timeframe to launch a gpu at least 6months after first samples. It's only possible earlier sometimes if the first silicon is so good that it doesn't need a metal spin, but that's not often the case.
 
How much space on that 815mm^2 die do those tensor cores take up would you guess...?
Depends on if they're independent from the shaders. They could be standalone units or simply a nice marketing name for packed FP16 with custom swizzle patterns and extra adders.

Integrated makes the most sense as independent will result in a larger chip that still won't be close to competing with Google's TPU on power, performance, or cost. They'd be better off with a learning focused chip if they really wanted to compete. Should be more than enough market to warrant it.
 
They have ~34% more die area to work with while transistor density has only increased by ~3% and transistor count by ~38%. Yet, there's 50 % more L2 cache and 40 % more SMs while also adding Tensor cores and INT32. From those numbers, Tensor cores alone cannot be THAT large (or THAT standalone IMHO). And since frequencies are similar to GP100 as well, I doubt they could save a lot of space in accelerating the critical path.
 
With a little math it might be possible to extrapolate an estimate from Google's TPU. Unless they've added functionality, built out the register file, etc it's a multiplier with an adder attached. The only reason to make it independent would be equal, concurrent workloads under most situations.

The real space hog with Volta likely comes from the register file accommodating their new thread sync scheduling. If it's actually packing 32 threads out of a group dynamically to handle divergence and increase utilization.
 
The real space hog with Volta likely comes from the register file accommodating their new thread sync scheduling. If it's actually packing 32 threads out of a group dynamically to handle divergence and increase utilization.
Are you sure it's doing that?

From my reading of the white paper, the new threading stuff just makes sure that you can't run into a whole class of deadlocks anymore, but I didn't see anything about dynamically packing threads together.

(In fact, I think there was an explicit mention about how the thread stuff does not actually increase utilization.)
 
They have ~34% more die area to work with while transistor density has only increased by ~3% and transistor count by ~38%. Yet, there's 50 % more L2 cache and 40 % more SMs while also adding Tensor cores and INT32. From those numbers, Tensor cores alone cannot be THAT large (or THAT standalone IMHO). And since frequencies are similar to GP100 as well, I doubt they could save a lot of space in accelerating the critical path.

Yes, additionally L1/Shared Memory increased by 45% per SM and Nvlinks increased 50% from 4 to 6. So quite a few things which have grown more than the transistor count increase. But you offset this a bit with the memory interface/rops which didn't grow.
 
Are you sure it's doing that?

From my reading of the white paper, the new threading stuff just makes sure that you can't run into a whole class of deadlocks anymore, but I didn't see anything about dynamically packing threads together.

(In fact, I think there was an explicit mention about how the thread stuff does not actually increase utilization.)
Possible I'm wrong, but I'd have thought flow control would have addressed that problem already.

I'll have to go look for that reference, but in the case of diverged threads a deadlock shouldn't be possible or trivially resolved to make a major feature. May be an Nvidia thing as AMD would have the scalar doing flow control which could likely correct it. The wording in the blog is a bit ambiguous, but mentions optimizing threads across warps. I'm wouldn't call that optimal as some portion of the SIMT is masked off. I recall somewhere mention if hardware scheduling taking more space, but counter-intuitively yielded better performance.

Given a per lane PC, a packing mechanism for threads guaranteed to be executing the same instruction within a group(same kernel) makes sense. Hard part would be coalescing the access, but that already exists with diverged threads.
 
I'll have to go look for that reference, but in the case of diverged threads a deadlock shouldn't be possible or trivially resolved to make a major feature.
The deadlock is due to programmers manually inserting _syncwarp statements inside divergent sections of code. It's a programer's bug, but if you visit the CUDA devtalk forum, it's obviously a very common mistake and it's sometimes not obvious.
I don't know how much of a feature it is in terms of HW implementation, but I think it could a big deal for programmers.

May be an Nvidia thing as AMD would have the scalar doing flow control which could likely correct it.
I'm not sure about this. It doesn't seem like something that's very architecture specific if you have one PC and execution masks (which GCN has, just like Pascal and earlier) instead of one PC per thread.
Does OpenCL have the equivalent of the CUDA syncwarp operation? If not, that would go a long way in avoiding the issue. :)
 
Deadlock could occur with synchronization if code was written that took the SIMT model at face value. It could be a warp instruction, or as Nvidia's marketing notes something like implementing spin locks in each lane, and then putting the lane with the lock on the wrong side of a branch.
It was one of the large leaks through the SIMT abstraction that GPGPU had since Nvidia coined the term SIMT.

While I'm not sure how much it matters in the grander scheme of things, it makes GPGPU a little bit less stupid and prone to killing itself than it has been since its inception. How much may depend on what limits there are to Volta's ability to let the threads roam relative to one another.
 
Back
Top