Nvidia Pascal Speculation Thread

Status
Not open for further replies.
If we're going to be comparing Pascal/Volta to KNL, then we really need to bring up NVLink, since we're already in the specialized hardware realm. They're claiming 50-200 GB/s for it, depending on the number of links - that is, access to host memory at full bandwidth. That means that you'll get just as much DDR4 capacity connected to the Tesla card over NVLink as KNL gets.
 
KNL is about 90GB/s supposedly.
SGEMM isn't main memory bandwidth bound in any meaningful fashion (<< 0.1 byte per FLOP in any decent implementation), so ~480GB/s should support around 7 TFLOPs without breaking a sweat.

It's interesting how badly optimised Caffe is:

https://software.intel.com/en-us/ar...d-training-on-intel-xeon-e5-series-processors

From what I gather, a lot of the deep learning stuff is depending on multiplication of oddly shaped matrices, that is, long skinny ones. In these cases, you end up needing a lot more bandwidth than for square matrices, since you're getting close to vector-matrix multiplication. I believe this is why batch size affects performance to the extent it does.
 
If we're going to be comparing Pascal/Volta to KNL, then we really need to bring up NVLink, since we're already in the specialized hardware realm. They're claiming 50-200 GB/s for it, depending on the number of links - that is, access to host memory at full bandwidth. That means that you'll get just as much DDR4 capacity connected to the Tesla card over NVLink as KNL gets.
Technically, this is true. In practice, though - I'm not planning to port my software to IBM Power, which is sadly necessary if one wants to use NVLink. Lots of work to get software running on a different system.
 
Do you see a benefit in GPU to GPU nvlink? One of the major benefits that was listed during the GTC keynote was that fast exchange of weights between GPUs would be a major advantage.
 
Do you see a benefit in GPU to GPU nvlink? One of the major benefits that was listed during the GTC keynote was that fast exchange of weights between GPUs would be a major advantage.
It would be useful, for sure. My big problem with NVLink is that it's not routable, and it can't use ribbon cables, so the topology has to be burned into the motherboard. I'm concerned that this will make things low volume and expensive.
 
If we're going to be comparing Pascal/Volta to KNL, then we really need to bring up NVLink, since we're already in the specialized hardware realm. They're claiming 50-200 GB/s for it, depending on the number of links - that is, access to host memory at full bandwidth. That means that you'll get just as much DDR4 capacity connected to the Tesla card over NVLink as KNL gets.
I wonder if NVidia will lock "high performance" cuDNN on Pascal to the Tesla variants, like it did with double-precision.
 
From what I gather, a lot of the deep learning stuff is depending on multiplication of oddly shaped matrices, that is, long skinny ones. In these cases, you end up needing a lot more bandwidth than for square matrices, since you're getting close to vector-matrix multiplication. I believe this is why batch size affects performance to the extent it does.
SGEMV wants, say, 25 times more bandwidth per FLOP than SGEMM on square matrices, so yes skinny matrices sounds like a problem.

Are these matrices sparse too? Are the longest dimensions in the 10s or 100s of thousands?
 
Are these matrices sparse too? Are the longest dimensions in the 10s or 100s of thousands?
The convolutional layers are, say, 256x256 images that multiplied for each pixel with a multiple 3x3 or 5x5 kernels. Those are definitely not sparse.
 
SGEMV wants, say, 25 times more bandwidth per FLOP than SGEMM on square matrices, so yes skinny matrices sounds like a problem.

Are these matrices sparse too? Are the longest dimensions in the 10s or 100s of thousands?
The models I train are primarily RNN (LSTM), so they have dimensions like (1024 X 1024) X (1024 X 64) => 1024 X 64.
They are dense.
 
Page 7 of this NVIDIA presentation has a DP performance and bandwidth roadmap for Tesla GPUs.

Pascal: ~4000 DP GFLOPS, ~1000 GB/s
Volta: ~7000 DP GFLOPS, ~1200 GB/s
(GFLOPS and bandwidth seem to be accurate to 2 and 3 significant figures respectively)
I noticed the following fact a little while ago and I didn't see it mentioned here: on page 7 of the presentation, the Tesla GPU for 2012 is the K20 and not the K20X, even though both parts were launched that year. The same situation holds for 2010, since the M2050 is in the graph when both it and the M2070 were 2010 products.

Code:
                    K20   K20X   TITAN
        Chip SMs     15     15      15
     Enabled SMs     13     14      14
Core clock (MHz)    706    732     837
       SP GFLOPS   3520   3940    4500
       DP GFLOPS   1170   1310    1500
             TDP    225    235     250
     SP GFLOPS/W     16     17      18

If the 2016 Pascal lineup includes two Teslas then I think it's likely that the 4 DP TFLOPS refers to the lower-end one. In this case, two Pascal Teslas and a Pascal TITAN that are similar to the above Kepler parts in terms of the proportions of enabled SMs, ratios of GFLOPS, and TDPs could be along these lines:

Code:
                    [P]   [PX]  TITAN "Y"

        Chip SMs     32     32         32   (case 1)
     Enabled SMs     28     30         30
Core clock (MHz)   1100   1150       1300   (approximate)

        Chip SMs     36     36         36   (case 2)
     Enabled SMs     31     34         34
Core clock (MHz)   1000   1000       1150   (approximate)

       SP GFLOPS   8000   9000      10000   (approximate)
       DP GFLOPS   4000   4500       5000   (approximate, assuming 1:2 DP)
             TDP    225    235        250
     SP GFLOPS/W     36     38         40
That gives a more reasonable FLOPS increase of 45-50% from the TITAN X to the higher-end Pascal Tesla and 60-65% from one TITAN to the next. (1.3 GHz seems rather high though, but if it actually has 32 SMs then I expect the die to be a fair amount smaller than the GK110, so we might see (example) 29/31/31 instead of the 28/30/30 calculated from a direct comparison.)
 
It'd be odd though, for Nvidia to go and brag about its lowest performing chip. Perhaps it's the lowest end one they're confident of going though based on their first engineering samples and yields, and might reveal a more functional/higher clocked chip later if they find yields are good enough?
 
I noticed the following fact a little while ago and I didn't see it mentioned here: on page 7 of the presentation, the Tesla GPU for 2012 is the K20 and not the K20X, even though both parts were launched that year. The same situation holds for 2010, since the M2050 is in the graph when both it and the M2070 were 2010 products.
Perhaps it was just done for marketing reasons; to make the graph trend look better. Otherwise the 2011-2012 improvement from K20X to K40 would be rather meagre.
The M2050 and M2070 would produce identical results in the the page 7 graphs, since the M2070 just doubles the memory capacity.
 
Perhaps it was just done for marketing reasons; to make the graph trend look better. Otherwise the 2011-2012 improvement from K20X to K40 would be rather meagre.
The M2050 and M2070 would produce identical results in the the page 7 graphs, since the M2070 just doubles the memory capacity.
The M2070 isn't represented in the graphs. I think you'll find that it is the M2090 that is shown. The M2050/2070 is GF100 powered, the M2090 is GF110 - so higher core count, higher frequencies - and thus bandwidth and floating point ops.
 
From the second slide on the TechPowerUp page:
9 inception layers
FCDI3KU.png


Honestly I'm surprised by the low SP FLOPS of the DRIVE PX 2. It only has slightly more SP FLOPS than the TITAN X and I can't imagine that the Tegra parts take more than a small piece of the 250 W TDP. Is there something I'm missing here?
 
Indeed a kind of low. We are probably talking about: 2x ~2B transistors for the Tegras + 2x ~6B transistors for the Pascal GPUs.

My guess to 8 TFLOPs:

Tegra: 2x 384SPs @ 0.8GHz = 1.2 TFLOPs
Pascals: 2x ~4096SPs @ 0.85GHz = 6.8 TFLOPs

Since this supercomputer must perform in hard conditions (>100°C heated up super cars), clockrates should be such low.

Power Budget could be:
Tegra: 2x 20W
GDDR5: 2x 16W
Pascals: 2x ~80W
Boards/IO: ~18W.
 
With a 250W budge, NVIDIA manage to put 2 Denver and 2 Pascal cores in a single board, judging by this, I will expect NV to launch dual-Pascal+HBM2-based Tesla cards from the very beginning of Pascal life cycle, thats would be 2TB/sec of memory bandwidth per GPU slot, that would be a significant performance jump comparing to K80, cannot wait.
 
From that it now becomes more clear how the next big Pascal i.e. GP200 will look like
4096 SP / 8 TFLOPS SP / 4 TFLOPS DP
Regarding the FP16 performance, NV have created this new metric of
DLTOPs (deep learning tera operations per second)
Which would be at 24 DLTOPS. Given the new name that indicates it's not the same as 24 TFLOPS FP16.

For compute applications that need DP and DL it will be much better compared to GM200.
But for gaming and SP that would only be about 25% improvement which is rather a small improvement.
 
Status
Not open for further replies.
Back
Top