Nvidia Volta Speculation Thread

Why would it need undervolted? That would surely help, but nobody is undervolting in a server market. If anything they simply run cards slower to be more power efficient. The savings with memory types are well established, so feel free to educate yourself before making a fool of yourself like you just did here.
because memory is only one (small) part of the power equation. For example, on your beloved MI25 300W accelerator board, HBM2 consumes only 30 watts, ie 10%. The Silicon consumption is still by far the biggest problem and Vega is behind Pascal, whatever says the single AI bench AMD showed us with their established now legendary fake results and using an old version of CUDA. So I will wait for independent results before admitting MI25 is faster than now discontinued P100, which by the way is irrelevant as V100 is now in the market...

So your reality broke down and you substituted figures to make it accurate? Take 4 chips, disabled three of them, then presumed the one remaining chip still consumed the same amount of power? I suppose that's one way to make 180TFLOPs@250W less than 120TFLOPs@300W.
Check your facts. TPU2 is rated at 250W for a single chip. One blade of 4 TPU2 uses redundant 1500W PSUs. If you had ever looked at the MA-SSI-VE heatsink of TPU2 you would have keep quiet. I save you a google search with the picture below:

a10.jpg


I think you're the one in need of a reality check. Compute and graphics are entirely separate areas. Vega is already beating P100 in some tests as expected, yet you feel Nvidia's GDDR5 offerings are superior to even their largest chip? That's just **** stupid, yet you're accusing me of being a fanboy for actually posting accurate information while you link marketing BS? Please show yourself out because this crap is hardly worth the effort of responding.
See my reply above ; up to now, we have a single biased compute AI bench presented by AMD on old Nvidia hardware and old software, but you still call it a win ? As for gaming Vega, let's wait for independent AI benches before any definitive conclusion, especially against Volta range, the real competition.
Then I remind you that the market is not only the single top chip, but customers are buying much more of the smaller cards (from both vendors, ie look at Instinct range). Thus Tesla and Instinct are also available with GDDR5/5x memory, reason of my remark. CQFD
Finally, we agree in one point. I won't neither spend more of my time responding to your nonsens and FUD. I still wish you a nice day :smile:
 
Last edited:
Vega is already beating P100 in some tests as expected, [...]
Apart from IHV launch decks - is there any actual data available? I mean, substantial data, not abusing multi-billion-transistor monsters for Direct Draw2D? I would be glad to see some independent analysis, since most outfits (ours including) do not have access to P100-class cards. IOW, measuring the chips performance, not how artificially crippled it's drivers are or if they've got an SSD on board. For the latter, I know the numbers, but they could have been achieved with a similarly outfitted Polaris GPU already, since nothing is as slow as going off-board.
 
That's just **** stupid, yet you're accusing me of being a fanboy for actually posting accurate information while you link marketing BS?
Only one linking marketing BS here is you actually, the only one obssessing about fantastical gains without any shred of evidence other than vague and obscure marketing materials.

Vega is only beating GP100 in an old AMD marketing slide, that has been bombarded with criticism towards it's accuracy and validity. And we all know how credible AMD's marketing has been lately with stunts like hiding fps, blind tests, selectively picking and cherrypicking certain results, or out right lying (480CF beating 1080, Premium VR, Vega/FuryX beating 1080 in mininum fps) and the list goes on. They didn't even bother reiterate the test during Radeon Instinct's launch. All they did was post some lame theoretical flop count comparisons on their official page, and even denying to ever have verifiable external tests.
AMD has not independently tested or verified external/third party results/data and bears no responsibility for any errors or omissions therein.

So unless you have any confirmation from a "trustworthy" source, NO, Vega isn't beating GP100.
 
The story would be different if AMD didin't launch Vega (the uarch that heavily relies on software) with half-baked software surrounding it.
On the scale of 1 to Fermi this is Fermi in terms of bad launches.

The half-baked software has always been there. Hawaii wouldn't need the jet turbine if AMD had better drivers at the start, the 390 series saw massive tessellation improvements at launch, 7970 saw massive improvement in BF3 almost a year after launch. This is worse than Fermi since AMD didn't have a Fermi die-sized card at the time to pummel Fermi like nvidia has 1080Ti nor were they tied with the gtx480. The point with the 'if' scenario is that AMD have to push their cards and they take a big hit to efficiency. AMD would have a much better standing with enthusiasts this round if not for 1080Ti still being 10% better at stock than what Vega can do with UV/OC with more power.

Google Nvidia Max-Q laptop reviews and you will see that Pascal can be much more power efficient than what we see on desktop cards

That might be so but I don't see any relevance to the fact that both AMD and nvidia cards I have have the same 10% more performance for 30% higher power draw while being very similar in stock performance and power draw and thus reducing power draw on the stock 1070 wouldn't have the same impact.
 
Apart from IHV launch decks - is there any actual data available? I mean, substantial data, not abusing multi-billion-transistor monsters for Direct Draw2D? I would be glad to see some independent analysis, since most outfits (ours including) do not have access to P100-class cards. IOW, measuring the chips performance, not how artificially crippled it's drivers are or if they've got an SSD on board. For the latter, I know the numbers, but they could have been achieved with a similarly outfitted Polaris GPU already, since nothing is as slow as going off-board.
Not good third party benchmarks as that stuff tends to stay internal as companies run their data. Other technologies aside, there is a subset of HPC(fluids, sparse matrix, particle sims, image manipulation) that fall inline with pure FLOPs. So linpack will give a fairly accurate indication of performance and track theoretical numbers, therefore few custom benchmarks. Same thing with tensor hardware with everyone designing systolic arrays as the patterns are highly predictable and repeated. Essentially ultra-wide cascaded SIMDs. The very reason supercomputer clusters can support that many nodes.

I haven't touched cluster stuff since college, but the math wouldn't have changed. Those tools will hit one bottleneck and hit it hard. That's just the nature of the data, complex branching and behavior won't exist in many cases. If not TFLOPs, memory capacity as the problems/sims/systems end up being huge once you get past classroom demos. Turning into SANs and unified/distributed memory type systems often falling to storage arrays.

The point with the 'if' scenario is that AMD have to push their cards and they take a big hit to efficiency. AMD would have a much better standing with enthusiasts this round if not for 1080Ti still being 10% better at stock than what Vega can do with UV/OC with more power.
Raja did say raw performance was the largest factor in sales. The same applies to Nvidia, but they limit performance to position themselves safely ahead of competition. UV/OC would generally benefit AMD more just due to positioning on exponential curves. The real issue is software and not the underlying hardware. I really wouldn't be surprised if Nvidia had TBDR style optimizations infringing some patents. Power consumption isn't necessarily worse, but deliberately trashed as evidenced by the power saving modes. Negligible performance hits with double digit power drops.
 
xpea and Antichrist4000: please refrain from personal insults an accusations of phanboism. Thanks.
 
because memory is only one (small) part of the power equation. For example, on your beloved MI25 300W accelerator board, HBM2 consumes only 30 watts, ie 10%.
You mean ~100W with GDDR5. A full third or more of the power budget, necessitating the HBM in the first place. It's almost like there is a reason everyone uses it on their high end chips. Feel free to misconstrue the comparisons though. GDDR5 products will have a difficult time in power efficiency comparisons, not to mention density.

Check your facts. TPU2 is rated at 250W for a single chip. One blade of 4 TPU2 uses redundant 1500W PSUs. If you had ever looked at the MA-SSI-VE heatsink of TPU2 you would have keep quiet. I save you a google search with the picture below:
Again, please check your facts as you said and that picture with P100, Xeons, and whatever else is there is a poor reference. Someone might almost mistake P100 for TPU2. You're seriously comparing a GPU with far more functionality to a product that is almost entirely tensor cores and expecting it to be vastly superior. The systolic arrays in TPU2 are as efficient as you can get and memory comparable.
 
Again, please check your facts as you said and that picture with P100, Xeons, and whatever else is there is a poor reference. Someone might almost mistake P100 for TPU2. You're seriously comparing a GPU with far more functionality to a product that is almost entirely tensor cores and expecting it to be vastly superior. The systolic arrays in TPU2 are as efficient as you can get and memory comparable.
From google own blog (cannot be more official):
https://www.blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/
you can find the exact same picture of TPU2 blade board with the huge heat sinks attached to the TPU2s:
tpu-v2-3.2e16d0ba.fill-1592x896.jpg

As we say in France, "They are no more blind than the one who doesn't want to see" :no:

Please, once for all, accept the truth, apologize for your error and move on, you will greatly benefit from it..
 
https://www.nextplatform.com/2017/05/22/hood-googles-tpu2-machine-learning-clusters/

Google’s first generation TPU consumed 40 watts at load while performing 16-bit integer matrix multiplies at a rate of 23 TOPS. Google doubled that operational speed to 45 TFLOPS for TPU2 while increasing the computational complexity by upgrading to 16-bit floating point operations. A rough rule of thumb says that is at least two doublings of power consumption – TPU2 must consume at least 160 watts if it does nothing else other than double the speed and move to FP16. The heat sink size hints at much higher power consumption, somewhere above 200 watts.

The size of these heat sinks screams “over 200W each.” It is easy to see that they are much larger than the 40 watt heat sink on the original TPU. These heat sinks fill two Google vertical 1.5-inch Google form factor units, so they are almost three inches tall. (Google rack unit height is 1.5 inches, a little shorter than the industry standard 1.75-inch U-height).
 
Please keep in mind that the next set of TPU2 drivers is practically guaranteed to reduce the heat sink size by 35% before making any comparison with competing product.
 
HPC Innovation Lab. September 27 2017
In this blog, we will introduce the NVIDIA Tesla Volta-based V100 GPU and evaluate it with different deep learning frameworks. We will compare the performance of the V100 and P100 GPUs. We will also evaluate two types of V100: V100-PCIe and V100-SXM2. The results indicate that in training V100 is ~40% faster than P100 with FP32 and >100% faster than P100 with FP16, and in inference V100 is 3.7x faster than P100. This is one blog of our Tesla V100 blog series. Another blog of this series is about the general HPC applications performance on V100 and you can read it here.
...
A single Tensor Core performs the equivalent of 64 FMA operations per clock (for 128 FLOPS total), and with 8 such cores per Streaming Multiprocessor (SM), 1024 FLOPS per clock per SM. By comparison, even with pure FP16 operations, the standard CUDA cores in a SM only generate 256 FLOPS per clock. So in scenarios where these cores can be used, V100 is able to deliver 4x the performance versus P100.
...
As in our previous deep learning blog, we still use the three most popular deep learning frameworks: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Both NV-Caffe and MXNet have been optimized for V100. TensorFlow still does not have any official release to support V100, but we applied some patches obtained from TensorFlow developers so that it is also optimized for V100 in these tests. For the dataset, we still use ILSVRC 2012 dataset whose training set contains 1281167 training images and 50000 validation images. For the testing neural network, we chose Resnet50 as it is a computationally intensive network. To get best performance, we used CUDA 9-rc compiler and CUDNN library in all of the three frameworks since they are optimized for V100.
http://en.community.dell.com/techcenter/b/techcenter
 
Last edited by a moderator:
Are DGX-1 and DGX-Stations currently the only ones using NVLink2? Curious since the Dell benchmark server mentioned only hardware changes were GPU's.
All the performance results in this blog were measured on a PowerEdge Server C4130 using Configuration G (4x PCIe V100) and Configuration K (4x V100-SXM2). Both these configurations have been used previously in P100 testing. Also except for the GPU, the hardware components remain identical to those used in the P100 tests as well: dual Intel Xeon E5-2690 v4 processors, 256GB (16GB*16 2400 MHz) Memory and an NFS file system mounted via IPoIB on InfiniBand EDR were used. Complete specs details are included in our previous blog. Moreover, if you are interested in other C4130 configurations besides G and K, you can find them in our K80 blog.
 
Found an interesting page at Thinkmate where you can customize many rack servers, with prices:
http://www.thinkmate.com/systems/se...ter&utm_campaign=ced8f9355e-NVIDIA-v100-Volta

If you click on the "GPX XT4-24S1 4NVLINK" model, you reach this page:
http://www.thinkmate.com/system/gpx-xt4-24s1-4nvlink
From here you can add up to 4 "NVIDIA® Tesla™ V100 GPU Computing Accelerator - 16GB HBM2 - SXM2 NVLink" art $7999 each, much lower than the initial $16k that people was expecting. All in one, V100 is only slightly more expensive than P100, which gives Volta a better value/performance for money than Pascal. But it also proves that the yields are good, very surprising for such Mammoth chip !
 
Oracle joins the long list of Cloud providers offering GPU accelerated services. They are adding P100 and V100 in their racks:
... we’re excited to announce that you’ll be able to utilize NVIDIA’s Pascal-based Tesla GPUs on our newly announced X7 hardware. With no hypervisor overhead, you’ll have access to bare metal compute instances on Oracle Cloud Infrastructure with two NVIDIA Tesla P100 GPUs to run CUDA-based workloads allowing for more than 21 TFLOPS of single-precision performance per instance.

...

Oracle is also working closely with NVIDIA to provide the next generation of GPUs based on the Volta Architecture in both bare metal and virtual machine compute instances soon, allowing for up to 8 NVIDIA GPUs, all interconnected with NVIDIA NVLink. These instances enable larger workloads to fit a single compute instance, while optimizing communication between GPUs with NVIDIA NVLink. This is going to be a game changer for customers, allowing them to essentially rent a supercomputer by the hour!

https://blogs.oracle.com/oracle-and-nvidia-provide-accelerated-compute-offerings

Another Big win for Nvidia
 
This is technically "post-Volta", but I think this thread might be the next best place to share.

https://www.anandtech.com/show/1191...-pegasus-at-gtc-europe-2017-feat-nextgen-gpus

130 TOPS in 220ish W is pretty sizeable increase considering V100 does 120 TOPS in 300W.

At 320 TOPS for a dual GPU plus dual Xavier system, this works out at 130 TOPS per GPU (the Xavier SoCs are already quoted as 30 TOPS at 30W). Meanwhile on power consumption, with Xavier already speced for 30W each, this means we're looking at around 220W for each GPU. In other words, these are high-end Gx102/100-class GPU designs. Coincidentally, this happens to be very close to the TOPS performance of the current Volta V100, which is rated for 120 TOPS. However the V100 has a 300W TDP versus an estimated 220W TDP for the GPUs here, so you can see where NVIDIA wants to go with their next-generation design.
 
Back
Top