Nvidia Volta Speculation Thread

Can you come up with a good reason to not use the Volta SM going forward in gaming GPU?

I’m not aware of anything in Volta that is different from Pascal in terms of feeding the SMs (geometry handling etc.)

So as long as the new SMs don’t regress in anything compared to the new ones, I don’t see why Nvidia would choose Pascal.

The new SMs seem to be similar in terms of area (if you ignore the extraneous stuff), they are more power efficient, they have much better caches, they are so much better for integer stuff, and the clocks are in the same ballpark as well as long as it doesn’t get throttled due to power limits, which could be explained by the leakage of large, idle FP64 and tensor cores.

So just strip the FP64 units, replace the FP16 tensor cores by INT, remove whatever ECC stuff is in there, and done!

I think the only question would be whether it makes sense to use 64 CUDA cores per SM (like P100 and V100) or 128 CUDA cores per SM (like every other Maxwell/Pascal GPU).
I am sure the Quadro GP100 was tested by some people with games, anyone with a better memory remember where?
Maybe worth revisiting that to see if there is a correlation with regards to the CUDA core/SM ratio and whether it is potentially detrimental to some games; sure I heard that it may not be ideal, and we see quite a few games not reaching right performance scaling even outside of ROPs while a few others do reach what should be expected 30-40% improvement.
Apart from that like you I would expect Volta to be next gaming arch minus Tensor/FP64, albeit possibly with a differentiated name outside of the flagship mixed-precision-Tensor GPU.
Jonah Alben has mentioned Volta architecture generally works well with games.
 
Last edited:
I am sure the Quadro GP100 was tested by some people with games, anyone with a better memory remember where?
Never tested with games unfortunately, only pro apps.

I'm pretty sure the Titan V is somewhat impacted by its lower ROP and geometry rates compared to the Xp and 1080 Ti. At least in some games.

And due to the low amount of games tested so far, a significant driver improvement in such games isn't out of question IMO, which would dramatically improve the current perception of how fast it is.
We have multiple factors indeed, low clocks, low ROPs count, immature drivers (obvious from the frame pacing issues and fps lock that several games have). Never the less, NVIDIA is touting TitanV as a part of 10 series in the driver selection page, so a completely new product is almost a given at this point.
 
I think the only question would be whether it makes sense to use 64 CUDA cores per SM (like P100 and V100) or 128 CUDA cores per SM (like every other Maxwell/Pascal GPU).
I don’t understand what the practical difference is between the 128 and the 64 core Pascal SM. From where I stand, they are identical. Even Nvidia is confused about it, sometimes giving gp100 30 128 core SMs and sometimes 60 64 core SMs.
 
I don’t understand what the practical difference is between the 128 and the 64 core Pascal SM. From where I stand, they are identical. Even Nvidia is confused about it, sometimes giving gp100 30 128 core SMs and sometimes 60 64 core SMs.
You are changing the structure/size of register/instruction buffer-instruction cache/texture units-cache relative to number of CUDA cores, all of which is affecting warps/threads-thread blocks by doubling the SM-Cuda core ratio.
It is efficient in many ways but how it behaves in some workloads-code-gaming *shrug*.
There must be a reason Nvidia never went this route with any other of their Pascal GPUs and especially the higher end GP102 (remember shared with gaming segment); the Tesla P40 was Nvidia's highest performing FP32 HPC-scientific card.
Hmm I have never seen the GP100 mentioned by Nvidia having 64 CUDA per SM, reality is 56 SMs as the P100 has 3584 FP32 cores active.
60 is in theory possible as that would be the full active die, but it was never released as that.

If we had gaming results for both the Quadro GP100 and the Titan V it would help to give a better picture when it comes to games.
 
Last edited:
For reference purposes.
For one thing, as Nvidia co-founder and CEO Jensen Huang said during his call with Wall Street analysts that it costs about $1,000 to make the hardware in a Volta GPU card, give or take, and this is very expensive compared to prior GPU cards – all due to the many innovations that are encapsulated in the Volta GPU, from HBM2 memory from Samsung to 12 nanometer FinFET processes from Taiwan Semiconductor Manufacturing Corp to the sheer beastliness of the Volta GPU.
https://www.nextplatform.com/2017/08/11/nvidia-textbook-case-sowing-reaping-markets/


The full Pascal GP100 chip had 56 streaming multiprocessors, or SMs, and a total of 3,584 32-bit cores and 1,792 64-bit cores; with caches, registers, memory controllers, and NVLink interconnect ports, this Pascal GP100 chip weighed in at 15.3 billion transistors using the 16 nanometer FinFET process from Taiwan Semiconductor Manufacturing Corp.

The Volta GV100 chip has 84 SMs, each with 5,376 32-bit floating point cores, 5,376 32-bit integer cores, 2,688 64-bit floating point cores, and 672 TensorCores for deep learning; this chip is implemented in the very cutting edge 12 nanometer FinFET node from TSMC, which is one of the ways that Nvidia has been able to bump up the transistor count to a whopping 21.1 billion for this chip. To help increase yields, Nvidia is only selling Tesla V100 accelerator cards that have 80 of the 84 SMs on the GV100 chip activated, but in time, as yields improve with the 12 nanometer process, we expect for the full complement of SMs to be activated and the Volta chips to get a slight performance boost further from it. If you extrapolate, there is another 5 percent performance boost in there, and more if the clock speed can be inched up a bit.
https://www.nextplatform.com/2017/05/10/nvidias-tesla-volta-gpu-beast-datacenter/
 
It’s a slide that Nvidia presented at HotChips 2016...
Hmm..
And their 2016 P100 whitepaper has the correct numbers at 56, can you link the article or slides?
Edit:
NVM and thanks your right found the HotChips one, but that is the 1st I have seen that is wrong and they are assuming it had the same as the rest of the GPU line, which obviously it does not.

To put it into perspective it was even on their primary devblog page well before that August presentation with the 56SM, along with the whitepaper also well before August so no idea how they messed up with the HotChips presentation *shrug*.
https://devblogs.nvidia.com/parallelforall/inside-pascal/
https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
 
Last edited:
It was bugging me as I know John Danskin is pretty much on the ball as an engineer and a senior one at that, although the scope of the HotChips was specifically silicon/NVLink but still not an excuse for that to be wrong - John Danskin is the Nvidia name associated with the HotChips presentation.
So I digged around for some of his other presentations, later on he shows the GP100 with the 64 CUDA cores per SM, meaning double the ratio to the other GPUs and 56/60.
HETEROGENEOUS COMPUTING: http://salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017-danskin.pdf
You will notice the information with the same-similar die image (page 5 in above link) has now changed and the next slide shows the correct Pascal GP100 structure 56/60.

Slightly off-tangent but considering P100 was never released as the full active die (60SM), makes me think we will see the same again with V100 staying as 80 rather than eventually 84; just saying as some feel it may eventually launch as a fully active GPU but some thought that with the P100 as well.
 
Last edited:
Maybe we should wait for a multi-chip GPU to make its appearance first before declaring victory, and besides, Epyc isn't a GPU either so I don't really see how it is applicable.
We've seen papers from AMD and Nvidia on the matter. Both showed rather significant gains(30-40% as I recall), but didn't really go into scenarios beyond single die fabrication limits. The physics are simple and multiple chips will win as more silicon at lower voltages is more efficient due to less leakage. At least up until monolithic designs are running threshold voltages. Then it's a question of absolute performance versus efficiency.

Epyc may not be a GPU, but we've seen cost and performance tradeoffs versus Intel's.
 
Hmm..NVM and thanks your right found the HotChips one, but that is the 1st I have seen that is wrong and they are assuming it had the same as the rest of the GPU line, which obviously it does not.
I don't think it matters very much whether it has 30 SMs that have 128 cores or 60 SMs that have 64. But I still would like to know the reason why. :)
 
Basically, each SM has a block of shared memory/L1 cache associated with it. In GP100, they doubled the L1/shared memory by having two blocks of it per what used to be a SM. Due to addressing/whatever, each half of the original SM can only see one of the blocks, so it behaves like 2 64-thread SMs. There are likely structures (likely the SFU) still shared by both halves, so it's not as clear cut as it could be.
 
I don't think it matters very much whether it has 30 SMs that have 128 cores or 60 SMs that have 64. But I still would like to know the reason why. :)
I said earlier it changes the structure/size of register/instruction buffer-instruction cache/texture units-cache relative to number of CUDA cores, all of which is affecting warps/threads-thread blocks by doubling the SM-Cuda core ratio.

Apply Occam's Razor:
If it made no difference between 128 and 64 doubling the SM structure then why do it in the 1st place, although like I said Nvidia did report it is more efficient generally.
If it does make an efficiency difference then why only on the P100 and V100, the flagship mix-precision HPC dedicated GPUs.
And lastly why is it not applied to what used to be their fastest HPC FP32 GPU the GP102 Tesla P40; important difference between the GP102 and GP100 is one is also shared for gaming and also is not a FP64/FP32/FP16 mixed-precision GPU.

I mentioned earlier it is more efficient but also it may not be ideal for all workloads-code that includes some or many games; you notice the results for gaming can be 11% to 40% faster than a Titan xP and it cannot all be explained away by ROPs/CPU limits - some games just cannot use the cores-SM with the front end that well.
Unfortunately though the only way to really know is to also test the Quadro GP100 with games as well to see if there is a trend-behaviour.
 
Last edited:
It's pretty clearly a mistake on the HotChips slide since every other NVIDIA paper mentions 60 (56+4)
edit: It probably should say "TPC" instead of "SM" on the HotChips slide, since each TPC on GP100 has 2 SMs
 
I don't think it matters very much whether it has 30 SMs that have 128 cores or 60 SMs that have 64. But I still would like to know the reason why. :)
I think the biggest high-level difference is 2x the amount of shared memory bandwidth per ALU, and 1.5x the shared memory capacity (2x64KiB vs 1x96KiB). The bandwidth is especially important as the lack of shared memory bandwidth was a big bottleneck for Maxwell (vs Kepler and especially GCN/Fermi) which showed up a lot more in compute than gaming.

On GV100, the shared memory architecture is completely different, so it's hard to predict what they're going to do for future gaming GPUs... 128KB of hybrid L1/shared memory per 64 FP32 ALUs seems like an expensive luxury for gaming...

BTW, this is completely unrelated to the increase in register file size, as registers are effectively per warp scheduler. Some of the other points that CSI PC brings up, e.g. instruction buffer/cache sizes, are interesting and I hadn't thought about before - I don't really know whether HPC kernels are so large they might justify a larger cache size compared to gaming (and again how it plays with the different instruction cache architecture in GV100).
 
Last edited:
I think the biggest high-level difference is 2x the amount of shared memory bandwidth per ALU, and 1.5x the shared memory capacity (2x64KiB vs 1x96KiB). The bandwidth is especially important as the lack of shared memory bandwidth was a big bottleneck for Maxwell (vs Kepler and especially GCN/Fermi) which showed up a lot more in compute than gaming.
That makes a lot of sense.

But that means that GP100 can’t some shared data structures in shared memory because the maximum size is actually smaller. This is a rare exception, probably.

I wasn’t aware that Maxwell was a regression in shared memory bandwidth compared to Kepler. (To be honest: I’ve never given it any thought!)
 
Back
Top