Nvidia Volta Speculation Thread

Ryan Smith · Dec 21, 2017

silent_guy said:
I was about to comment on how enjoyable it is to read a well written piece of journalism. And then I stumbled into a repeat of this thing:

Please by all means drop me a line. We strive for technical accuracy.

Arun said:
Ryan (or anyone else from Anand) - how did you setup/implement the GEMM tests? I'm guessing it's cuBLAS multiplying two matrices which are *both* very large?

Correct. 2560 by 2048 by 16384.

BTW on the Beyond3D "Estimated FMA latency" test - it doesn't really make sense for GCN to be 4.5 cycles There are possible HW explanations for non-integer latencies but they're not very likely. The test inherently has some overhead (which can be amortised by trading off how long it runs in various ways) so maybe it's just higher on GCN for some reason which makes it "look" like 4.5 cycles when it's really 4 cycles; I'm not quite sure.

I agree half a cycle is very unlikely. But our tests report that half of a cycle very consistently. So at least for the moment it's what we stick with until we can better determine why it's so adamantly in there.

Arun · Dec 21, 2017

Ryan Smith said:
I agree half a cycle is very unlikely. But our tests report that half of a cycle very consistently. So at least for the moment it's what we stick with until we can better determine why it's so adamantly in there.

Fair enough! I wrote that test several years ago, in retrospect I wish a lot of these tests were auto-tuning to minimise overhead, it can be done in the framework but not in a very nice way sadly.

I'm tempted to rewrite some of these tests in Vulkan for Windows-only and open source the whole thing; I don't really want to spend too much time on graphics-only workloads as it's not my focus anymore (mostly deep learning), but Vulkan is arguably the best way to analyse compute in a vendor-independent way too (since NVIDIA has never taken OpenCL very seriously) so maybe it wouldn't be wasted effort.

firstminion · Dec 21, 2017

CSI PC said:
You do not see the appeal of a dense populated 150W solution from Nvidia competing against other products in the next 12 months?
Which current product do you see matching this?

Sure it will be superceded by the next generation from Nvidia, but then so is every generation.

The profit margin/costs-logistics/R&D probably makes more sense for Nvidia to continue with half-length 150W GPUs going forward that target more of the DL aspects where clients do not require the full hybrid mixed-precision implementation; though there is a large market of HPC-science that require the full hybrid and especially so as AI-DL matures.
There is the Tegra solution, who knows what will happen down the line with ARM tech as a server solution as Nvidia has never given up that HPC research.

I see the appeal but expect a denser solution in mid-term. Probably an in-house solution from google in the next 18-mo and one from nvidia in 2-years.

Arun · Dec 21, 2017

firstminion said:
I see the appeal but expect a denser solution in mid-term. Probably an in-house solution from google in the next 18-mo and one from nvidia in 2-years.

These parts aren't horribly cost-sensitive, I don't think density matters as much as power efficiency (of course it matters, but you're *much* better off with 25% better power efficiency rather than 50% lower cost, I think). Simply removing a lot of the graphics-related bits wouldn't improve power efficiency that much; it's all power gated or at least clock gated under deep learning workloads.

The real question is whether you can achieve *much* higher efficiency by creating a completely different kind of architecture than GPUs (e.g. Graphocore... or groq, the ex-Google TPU engineers start-up, claims 16x higher power efficiency than NVIDIA on their teaser webpage: https://groq.com/ - I'm guessing that's 7nm/FP8 rather than 12nm/FP16 though...)

GPUs are cool, but they aren't the answer to everything. I'd definitely like to write some articles exploring the design space and what the various trade-offs are if you create AI hardware from scratch, and whether the benefit is big enough to justify or if GPUs might still win by being able to address a wider market... hopefully will get around to writing that sooner than later

Ryan Smith · Dec 21, 2017

Arun said:
Fair enough! I wrote that test several years ago, in retrospect I wish a lot of these tests were auto-tuning to minimise overhead, it can be done in the framework but not in a very nice way sadly.

I'm tempted to rewrite some of these tests in Vulkan for Windows-only and open source the whole thing; I don't really want to spend too much time on graphics-only workloads as it's not my focus anymore (mostly deep learning), but Vulkan is arguably the best way to analyse compute in a vendor-independent way too (since NVIDIA has never taken OpenCL very seriously) so maybe it wouldn't be wasted effort.

You should totally do that. A new Vulkan-derived benchmark collection would be fantastic!

psurge · Dec 21, 2017

http://vathys.ai is another one - there is some detail about what they are doing at http://web.stanford.edu/class/ee380/Abstracts/171206-slides.pdf.

Arun · Dec 22, 2017

Yep I watched that video from Vathys a few days ago - smart guy, but typical excessively optimistic claims in hope of getting VC funding (it's not been properly funded yet AFAIK - it's effectively a Y Incumbator project). Their claims rely on several near-magical circuit-level technologies working perfectly, despite the very small size of their team at the moment

I practically rolled my eyes when they were talking about their 5x denser on-chip memory than SRAM. There's a very long history of people inventing things like that and failing to productise it...

Anyway to get back to the thread at hand - what I'm saying is there might be a lot of opportunity to make deep learning HW that's more efficient than a GPU, but just removing the graphics bits of Volta won't really achieve that, as they're already powered off for deep learning. And the HPC-specific bits allow them to reuse the same chip for those markets, so it wouldn't make much financial sense to remove those either.

NVIDIA is certainly willing to design specialised hardware for deep learning - they've already done so for inference with NVDLA, originally in Jetson, and now open sourced(!) - who knows whether they'll do the same for training, or if they'll just add more and more bits onto the GPU to accelerate it.

CSI PC · Dec 22, 2017

Arun said:
These parts aren't horribly cost-sensitive, I don't think density matters as much as power efficiency (of course it matters, but you're *much* better off with 25% better power efficiency rather than 50% lower cost, I think). Simply removing a lot of the graphics-related bits wouldn't improve power efficiency that much; it's all power gated or at least clock gated under deep learning workloads.

The real question is whether you can achieve *much* higher efficiency by creating a completely different kind of architecture than GPUs (e.g. Graphocore... or groq, the ex-Google TPU engineers start-up, claims 16x higher power efficiency than NVIDIA on their teaser webpage: https://groq.com/ - I'm guessing that's 7nm/FP8 rather than 12nm/FP16 though...)

GPUs are cool, but they aren't the answer to everything. I'd definitely like to write some articles exploring the design space and what the various trade-offs are if you create AI hardware from scratch, and whether the benefit is big enough to justify or if GPUs might still win by being able to address a wider market... hopefully will get around to writing that sooner than later

It also comes down how much they can deviate from their traditional modern CUDA architecture in terms of GPC-SM-Cache-register-aspects relating to Instructions/etc; there is probably a very good reason why only 8 Tensor Cores per SM.
Then also a massive complexity how to incorporate that into the CUDA ecosystem if they radically deviate from said architecture and would require multiple different solutions to be learnt by scientists and devs who are just getting to grips with the WMMA API and CUTLASS/CUDA libraries using Tensor cores.
This is further exacerbated that you still want ecosystem synergy with a key segment and that being for Nvidia Tegra or more specifically the products such as Xavier SOC (Volta CUDA cores) that is to be used out in the world for very large and diverse range opportunities from autonomous driving to transport or warehouse automation amongst other scenarios.

If an HPC client has to decide between multiple DGX-1 nodes scaled out that has great performance and flexibility or seperate dedicated DL training pureTensor accelerator/separate inferencing pure Tensor accelerator/dedicated mixed-precision accelerator nodes/different ecosystems, then some hard decisions have to be made from ecosystem-performance-power demand and also space perspective.

I remember the Google engineers also saying the TPU2 was much more efficient than Volta but that turned out not to be true going by a lot of the discussions around, just saying is all.
Nvidia's design is very performance efficient even as is, and we have not seen the figures for the 150W version that still has some flexibility (if it comes to market), but yeah there are other competitors emerging but not sure they have the same level of resources/influence as Intel/Nvidia.
Sort of reminds me how ARM is yet to still make it in the HPC segment but for years they have been touted as being a challenger, maybe nextgen Fujitsu but even then we will have to see how that pans out.

Arun · Dec 22, 2017

I think NVIDIA would try to support cuDNN but not CUDA and none of the other APIs if they made a deep learning-only accelerator. Again I'm skeptical it makes sense for them to do that, but it's a sensible option without breaking compatibility with existing frameworks too much...

I really want to write a longer article about all this but basically... I think the largest inefficiency for GPUs in deep learning is that GPUs are highly optimised for hiding memory latency, because prefetch is typically impossible for graphics workloads. Currently, deep learning is more a DSP-like workload, where it is possible to prefetch all of the data you need in advance into a SRAM buffer (think Cell SPEs if you're not familiar with modern DSPs).

The interesting thing is that highly optimised assembly-based matrix multiplication kernels on GPUs tend to only have 1 to 4 warps in flight per scheduler (typically 2 IIRC, rather than the maximum 16/scheduler). The kernels are optimised to extract a massive amount of ILP, which GPU compilers aren't smart enough to do on their own (this is a *hard* problem from having worked a bit on it as a hobby while at PowerVR). I'd like to know if that's still the case with the tensor cores, or if NVIDIA is using more warps this time, but I suspect it's something like 2 warps/scheduler - one of the many things I'd like to figure out once I have a card...

But really because the memory accesses are perfectly known in advance, you don't need any of that. You could just have 1 thread with 1 big block of SRAM and a matrix-multiply unit, and with good assembly, you'd achieve excellent efficiency. You don't need any dynamic scheduling between threads or anything... Looking at TPU2 diagrams, it looks suspiciously like an old-school single-threaded DSP with a big pool of SRAM and a 128x128 matrix multiplication accelerator (scheduled by the DSP, and reading from the SRAM directly).

I think GPUs can close the gap if the HW engineers optimise power efficiency for those extreme cases, rather than tuning the implementation around gaming workloads with completely different characteristics - e.g. as an extreme case, use a completely distinct HW path if you only have 1 warp per scheduler, and power gate the other scheduling logic! Not a very good example, but you get the idea. There's still one downside though: by having all this idle logic, you've increased the wire distance between the logic you do use, so you've effectively reduced power efficiency by increasing wire distance (lower locality)... I'm not convinced that's a very big effect, but it would be one reason why an AI-only Volta might have a power advantage vs V100.

The other question is whether AI is really nothing but 8-to-32-bit matrix multiplies. I'm skeptical myself - so there's definitely a risk to making an overly specialised accelerator if the algorithms change significantly! And then it becomes a trade-off between flexibility and performance, just starting from scratch again in a few years if needed...

Frenetic Pony · Dec 22, 2017

Arun said:
These parts aren't horribly cost-sensitive, I don't think density matters as much as power efficiency (of course it matters, but you're *much* better off with 25% better power efficiency rather than 50% lower cost, I think). Simply removing a lot of the graphics-related bits wouldn't improve power efficiency that much; it's all power gated or at least clock gated under deep learning workloads.

The real question is whether you can achieve *much* higher efficiency by creating a completely different kind of architecture than GPUs (e.g. Graphocore... or groq, the ex-Google TPU engineers start-up, claims 16x higher power efficiency than NVIDIA on their teaser webpage: https://groq.com/ - I'm guessing that's 7nm/FP8 rather than 12nm/FP16 though...)

GPUs are cool, but they aren't the answer to everything. I'd definitely like to write some articles exploring the design space and what the various trade-offs are if you create AI hardware from scratch, and whether the benefit is big enough to justify or if GPUs might still win by being able to address a wider market... hopefully will get around to writing that sooner than later

The answer to this depends entirely on the kind of NN being trained. Image recognition isn't going to get much better on specialized hardware. The standard neural net pattern for it fits incredibly close to the way the neurons in our eyes work, which is all spatial associativity, which is exactly what GPUs are designed to do. Other things might benefit though, recurrent neural nets might need larger caches than GPUs have to store and access previous temporal results more quickly. Hell here's the summary of a paper about, almost, exactly that. Though in the paper they store the relevant information in a memristor "reservoir" that I assume is just on chip.

Unfortunately that's not what Volta does. In fact I'd compare Volta to Sony's bullshit it tried pulling over a decade ago when they were fairly ascendant across a lot of electronics stuff. They tried making expensive locked in standards to abuse their monopoly, and it failed as competitors went in different directions. Volta's need for exclusive development lock in on already expensive hardware feels like the same thing, if not worse.

It's not a bet I'd be willing to make if I were a large company, dev time is already the most starved and bottlenecked thing for AI development. Buying a ton of chips might be expensive, but it's just not as expensive as potentially wasting years of multi million dollar salaries developing specifically for an architecture that could suddenly not exist, or get superseded all of a sudden. Much rather stick to more open, standardized things in the expectation that portability will be superior.

Arun · Dec 22, 2017

Frenetic Pony said:
The standard neural net pattern for it fits incredibly close to the way the neurons in our eyes work, which is all spatial associativity, which is exactly what GPUs are designed to do.

Not sure what you mean by "all spatial associativity" - yes, there are deep similarities with the human visual system, but the way it's implemented is quite different - it's still basically linear algebra on very large matrices/tensors. GPUs were not designed for that kind of workload, they were originally designed for pixel shaders which are a mix of scalar and small vector instructions with dependent texture instructions where the address depends on the previous computations. There's nothing like that in deep learning at the moment...

Other things might benefit though, recurrent neural nets might need larger caches than GPUs have to store and access previous temporal results more quickly.

The way GPUs work at the moment is you can reasonably expect to get all the activations from external memory - for many (not all) workloads the datasets are too large relative to the cache to hit much. What the L2 cache is mostly good for is improving reuse between different SMs reading the same data inside a single layer.

Hell here's the summary of a paper about, almost, exactly that. Though in the paper they store the relevant information in a memristor "reservoir" that I assume is just on chip.

I really like the idea of using memristors for AI - it feels like a very good fit to some things, and I'm still reasonably excited about memristors despite their relative lack of progress (vs early claims). But that summary seems to imply 91% accuracy on MNIST which isn't very impressive... It's more about something "good enough" at very low cost than a state of the art solution.

Unfortunately that's not what Volta does. In fact I'd compare Volta to Sony's bullshit it tried pulling over a decade ago when they were fairly ascendant across a lot of electronics stuff. They tried making expensive locked in standards to abuse their monopoly, and it failed as competitors went in different directions. Volta's need for exclusive development lock in on already expensive hardware feels like the same thing, if not worse.

It's not a bet I'd be willing to make if I were a large company, dev time is already the most starved and bottlenecked thing for AI development. Buying a ton of chips might be expensive, but it's just not as expensive as potentially wasting years of multi million dollar salaries developing specifically for an architecture that could suddenly not exist, or get superseded all of a sudden. Much rather stick to more open, standardized things in the expectation that portability will be superior.

Maybe I'm misunderstanding what you're trying to say, but it sounds wrong to me. All of the deep learning frameworks use NVIDIA's cuDNN framework (which is mostly hand-written assembly by NVIDIA). It was the first non-CPU deep learning API supported by a wide variety of framework, and NVIDIA was deeply involved with the framework developers to add support for it.

NVIDIA basically applied the same strategy they had with "The Way It's Meant To Be Played" by spending their own engineering resources to help those framework developers - that's why they have widespread support and nobody else does. Some of the newer frameworks like newer TensorFlow and MXNet support systems that allow third party HW vendors to more easily add their own acceleration - but that wasn't the case back then. There's an argument this is unfair in the same way TWIMTBP was unfair, but your claims about wasting years of multi million dollar salaries developing specifically for Volta feels completely implausible to me.

silent_guy · Dec 22, 2017

Frenetic Pony said:
Unfortunately that's not what Volta does. In fact I'd compare Volta to Sony's bullshit it tried pulling over a decade ago when they were fairly ascendant across a lot of electronics stuff. They tried making expensive locked in standards to abuse their monopoly, and it failed as competitors went in different directions. Volta's need for exclusive development lock in on already expensive hardware feels like the same thing, if not worse.

It’s been a while since I last looked at the cuDNN documentation, but back then it was just a collection of pretty high level API calls with nothing that was particularly linked to a GPU architecture.

In other words: relatively easy for others to duplicate for their own architecture. I suspect that this is exactly what AMD will do for their cuDNN competitor that will be released real soon now.

The real value is in the low level optimization of those API calls, but that’s something that would be hard to make portable to begin with ... if you don’t want to give up some performance.

And I just don’t see how Volta adds a lot of highly specialized lock-in features. A matrix multiplication is about as generic as it gets.

Anarchist4000 · Dec 22, 2017

CSI PC said:
Nvidia's design is very performance efficient even as is, and we have not seen the figures for the 150W version that still has some flexibility (if it comes to market), but yeah there are other competitors emerging but not sure they have the same level of resources/influence as Intel/Nvidia.

There isn't a lot of room to make it much more efficient anyways. Keep in mind this is stupidly simple straight line math with perfect predictability. More efficient means as closely tailored to the matrices involved and more efficient process tech. The trade-off is what other capabilities get added to make it more versatile. I wouldn't put it past a large company rolling their own ASIC once R&D figured out their algorithm.

Arun said:
Currently, deep learning is more a DSP-like workload, where it is possible to prefetch all of the data you need in advance into a SRAM buffer (think Cell SPEs if you're not familiar with modern DSPs).

Could prefetch, but it seems more likely sufficient bandwidth or capacity wouldn't exist. Warps are probably a function of matrix size. Scheduled in parallel until all accumulators are expended. Using a 64 wide SIMD where DSPs couldn't clock high enough to maintain that throughput.

CSI PC · Dec 22, 2017

Anarchist4000 said:
There isn't a lot of room to make it much more efficient anyways. Keep in mind this is stupidly simple straight line math with perfect predictability. More efficient means as closely tailored to the matrices involved and more efficient process tech. The trade-off is what other capabilities get added to make it more versatile. I wouldn't put it past a large company rolling their own ASIC once R&D figured out their algorithm..

Not sure how you can simplify it that much and correlate how much power/efficiency a design has.
Would be like saying all GPUs have same power-efficiency with their FMA cores using similar functions.

One major advantage with the Nvidia approach is how they manage to achieve comparative accuracy to FP32 with the FP16 Tensor operations for training, which requires several steps and an architecture able to support it effectively (linked in the past the Baidu paper showing real world implementation and results).
The downside is the required tweaking of the loss scaling, more broader real world tests are needed, and competitors may find unique solutions applicable to their own architectures.

Arun · Dec 22, 2017

Anarchist4000 said:
Could prefetch, but it seems more likely sufficient bandwidth or capacity wouldn't exist. Warps are probably a function of matrix size. Scheduled in parallel until all accumulators are expended. Using a 64 wide SIMD where DSPs couldn't clock high enough to maintain that throughput.

Sorry, maybe I'm misunderstanding what you're trying to say, but yes - in addition to the memory latency, for a typical SIMD unit, you need "aggregate SIMD width * ALU latency" data elements running in parallel to hide the ALU latency.

AMD GPUs do that (effectively - it's a bit more complicated) with thread-level parallelism. NVIDIA GPUs do it with a mix of thread-level and instruction-level parallelism. DSPs do it with just instruction-level parallelism traditionally. Effectively you DMA into SRAM and double/triple/... buffer to hide memory latency, then use ILP to hide ALU/accelerator latency.

Anarchist4000 · Dec 22, 2017

CSI PC said:
Not sure how you can simplify it that much and correlate how much power/efficiency a design has.
Would be like saying all GPUs have same power-efficiency with their FMA cores using similar functions.

Using similar process, precision, and functions they would be roughly the same. Layouts for core functions are fairly well established. It's when the design scales with networking and scheduling differences can appear. In the case of Tensors the designs are exceedingly simple. That's part of why so many companies are experimenting. Performing 256 consecutive multiplications with no dependencies.

Arun said:
Sorry, maybe I'm misunderstanding what you're trying to say, but yes - in addition to the memory latency, for a typical SIMD unit, you need "aggregate SIMD width * ALU latency" data elements running in parallel to hide the ALU latency.

I'm saying the design will be memory bandwidth bound very easily and there would be very little reuse of data. Overlap more a matter of conserving die space. The most efficient design a single cascaded (FP16+FP32) SIMD sized for the largest matrix encountered. Or less precision in the case of TPU. As you mentioned a DSP would work well, but not have the throughput necessary without a lot of them. The design is sized where there are enough flops to cover bandwidth. Nothing bursty like most code. Latency less of a concern as the flow is highly predictable. Complicated scheduling just doesn't exist, and I'd imagine throughput decreases substantially with smaller matrices. No packing or mechanisms to improve utilization.

Deleted member 2197 · Dec 23, 2017

CSI PC · Dec 23, 2017

Anarchist4000 said:
Using similar process, precision, and functions they would be roughly the same. Layouts for core functions are fairly well established. It's when the design scales with networking and scheduling differences can appear. In the case of Tensors the designs are exceedingly simple. That's part of why so many companies are experimenting. Performing 256 consecutive multiplications with no dependencies.
.

Eh it is far from simple, even Google at a high level has mentioned that and pretty clear with the papers released.
Also worth noting there is a large discrepancy between TPU2 and V100 in terms of performance/efficiency; bear in mind the TPU2 is 4xTensor processors on a board and possibly equivalent would be at least 2xV100 or if focusing on DL with less mixed-precision flexibility 3-4x half length 150W V100 - and the TPU2 matches your criteria of a simple matrix multiplier coprocessor/accelerator.
Also you cannot distill electronics engineering into such a simplified process; even with compute the difference between FMA cores design doing 'simple' maths with AMD and Nvidia is massive in terms of efficiency.
That said simple metrics are not the way to really measure these accelerators.

If it was that simple to design and easy to do efficienty, then you better start seriously complaining about AMD not implementing this in the MI25 or Navi, or a similar product possibly aligned with Epyc.
With all the resources Intel has, where is their Tensor accelerator if it is simple to do? - They had to buy Nervana, which looks like the recent announced NNP product still needs to be evolved to catch up with Google and Nvidia.
There is a difference between doing this in theory and a scalable real world design.

Deleted member 2197 · Dec 26, 2017

Frenetic Pony · Dec 27, 2017

Arun said:
Not sure what you mean by "all spatial associativity" - yes, there are deep similarities with the human visual system, but the way it's implemented is quite different - it's still basically linear algebra on very large matrices/tensors. GPUs were not designed for that kind of workload, they were originally designed for pixel shaders which are a mix of scalar and small vector instructions with dependent texture instructions where the address depends on the previous computations. There's nothing like that in deep learning at the moment...

The logical connections for the human eye neurons, linking between the receptors and the brain, look almost exactly like any basic neural net layer visualization you'd see to today. They are so close you'd mistake the two if you didn't pay attention. It's literally layers of neighboring neurons sending signals to the next layer up and so on and so on based on received signals. Just suggesting the reason GPUs happened to be better than CPUs is because GPUs were designed to work on producing visuals, which are processed by a set of real neurons hooked up in tight neighboring layers. So running a set of virtual neurons in tight neighboring layers is just doing something similar enough that a GPU ended up being pretty good at that as well.

This is especially true with training NNs. For deploying you can sit there and do your giant matrix calculations with specialized equipment much easier, but training NNs works so well on GPUs that major companies are happy to keep buying them instead of creating specialized chips (at least deploying NN's was the priority over replacing GPUs for training). But ithat's only true for today, as neural nets today, as mentioned, look (organizationally) a lot like the visual neurons we have attached to our eyes. Initial research suggests our brains look rather different in how the connections are laid out. And this could be why neural nets today can do better than humans at recognizing images, yet spend hundreds of thousands of hours and they still can't drive to save it's (or our) life. And why you need something beyond a basic NN, like recursive networks, to translate speech and etc. IE neural nets can do better than trained doctors at spotting problems in scans, but can't drive with a thousands sensors while a 16 year old (hopefully, it's half a joke) can drive with just two eyes.

The way GPUs work at the moment is you can reasonably expect to get all the activations from external memory - for many (not all) workloads the datasets are too large relative to the cache to hit much. What the L2 cache is mostly good for is improving reuse between different SMs reading the same data inside a single layer.

I really like the idea of using memristors for AI - it feels like a very good fit to some things, and I'm still reasonably excited about memristors despite their relative lack of progress (vs early claims). But that summary seems to imply 91% accuracy on MNIST which isn't very impressive... It's more about something "good enough" at very low cost than a state of the art solution.

Maybe I'm misunderstanding what you're trying to say, but it sounds wrong to me. All of the deep learning frameworks use NVIDIA's cuDNN framework (which is mostly hand-written assembly by NVIDIA). It was the first non-CPU deep learning API supported by a wide variety of framework, and NVIDIA was deeply involved with the framework developers to add support for it.

NVIDIA basically applied the same strategy they had with "The Way It's Meant To Be Played" by spending their own engineering resources to help those framework developers - that's why they have widespread support and nobody else does. Some of the newer frameworks like newer TensorFlow and MXNet support systems that allow third party HW vendors to more easily add their own acceleration - but that wasn't the case back then. There's an argument this is unfair in the same way TWIMTBP was unfair, but your claims about wasting years of multi million dollar salaries developing specifically for Volta feels completely implausible to me.

I'm basically arguing that Nvidia has added a lot of features that are, and may always be, only for Nvidia. AMD is trying to get into the market (And failing so far), and Intel's GPUs, and other chips, will probably be based heavily on training neural nets. But they probably also won't have the exact features put in by Nvidia so any work you do for Volta, any research for it, is simply non transferable. And Nvidia's CUDA libraries were even worse. You got to set up Neural Nets faster, but hey now you're locked into Nvidia (they hope) too bad you can't ever run those programs on anyone else's hardware! Come buy our $5k card because you spent all those months writing for our own tech and have no choice! The point is simple, Nvidia didn't build their libraries out of kindness. They did it because they gambled they'd make more money off it, by locking people in, than it cost in the first place.

Nvidia didn't make anything you couldn't with OpenCL, they just made it exclusive to them and tried lure you into their proprietary ecosystem. And that spells nothing but trouble, it's what Sony used to do. Buy products that only work with other Sony products! It's what Apple and Android both did, or tried to, with their apps, you'll hesitate to switch if you've invested hundreds in apps that suddenly won't work anymore (not that anyone buys apps other than games anymore, and those are F2P so who cares). Point is, they lure you in by making it seem easy, then trap you by trying to lock all the work you've done exclusively to their hardware.

Nvidia Volta Speculation Thread

Ryan Smith

Arun

Unknown.

firstminion

Arun

Unknown.

Ryan Smith

psurge

Arun

Unknown.

CSI PC

Arun

Unknown.

Frenetic Pony

Arun

Unknown.

silent_guy

Anarchist4000

CSI PC

Arun

Unknown.

Anarchist4000

Deleted member 2197

Guest

CSI PC

Deleted member 2197

Guest

Frenetic Pony

Similar threads