Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    611
    Likes Received:
    1,052
    Location:
    PCIe x16_1
    Please by all means drop me a line. We strive for technical accuracy.:)

    Correct. 2560 by 2048 by 16384.

    I agree half a cycle is very unlikely. But our tests report that half of a cycle very consistently. So at least for the moment it's what we stick with until we can better determine why it's so adamantly in there.
     
    Lightman, Bludd, Arun and 1 other person like this.
  2. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Fair enough! I wrote that test several years ago, in retrospect I wish a lot of these tests were auto-tuning to minimise overhead, it can be done in the framework but not in a very nice way sadly.

    I'm tempted to rewrite some of these tests in Vulkan for Windows-only and open source the whole thing; I don't really want to spend too much time on graphics-only workloads as it's not my focus anymore (mostly deep learning), but Vulkan is arguably the best way to analyse compute in a vendor-independent way too (since NVIDIA has never taken OpenCL very seriously) so maybe it wouldn't be wasted effort.
     
  3. firstminion

    Newcomer

    Joined:
    Aug 7, 2013
    Messages:
    217
    Likes Received:
    46
    I see the appeal but expect a denser solution in mid-term. Probably an in-house solution from google in the next 18-mo and one from nvidia in 2-years.
     
  4. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    These parts aren't horribly cost-sensitive, I don't think density matters as much as power efficiency (of course it matters, but you're *much* better off with 25% better power efficiency rather than 50% lower cost, I think). Simply removing a lot of the graphics-related bits wouldn't improve power efficiency that much; it's all power gated or at least clock gated under deep learning workloads.

    The real question is whether you can achieve *much* higher efficiency by creating a completely different kind of architecture than GPUs (e.g. Graphocore... or groq, the ex-Google TPU engineers start-up, claims 16x higher power efficiency than NVIDIA on their teaser webpage: https://groq.com/ - I'm guessing that's 7nm/FP8 rather than 12nm/FP16 though...)

    GPUs are cool, but they aren't the answer to everything. I'd definitely like to write some articles exploring the design space and what the various trade-offs are if you create AI hardware from scratch, and whether the benefit is big enough to justify or if GPUs might still win by being able to address a wider market... hopefully will get around to writing that sooner than later :)
     
  5. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    611
    Likes Received:
    1,052
    Location:
    PCIe x16_1
    You should totally do that. A new Vulkan-derived benchmark collection would be fantastic!
     
  6. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    xpea and Dayman1225 like this.
  7. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Yep I watched that video from Vathys a few days ago - smart guy, but typical excessively optimistic claims in hope of getting VC funding (it's not been properly funded yet AFAIK - it's effectively a Y Incumbator project). Their claims rely on several near-magical circuit-level technologies working perfectly, despite the very small size of their team at the moment :( I practically rolled my eyes when they were talking about their 5x denser on-chip memory than SRAM. There's a very long history of people inventing things like that and failing to productise it...

    Anyway to get back to the thread at hand - what I'm saying is there might be a lot of opportunity to make deep learning HW that's more efficient than a GPU, but just removing the graphics bits of Volta won't really achieve that, as they're already powered off for deep learning. And the HPC-specific bits allow them to reuse the same chip for those markets, so it wouldn't make much financial sense to remove those either.

    NVIDIA is certainly willing to design specialised hardware for deep learning - they've already done so for inference with NVDLA, originally in Jetson, and now open sourced(!) - who knows whether they'll do the same for training, or if they'll just add more and more bits onto the GPU to accelerate it.
     
    xpea likes this.
  8. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    It also comes down how much they can deviate from their traditional modern CUDA architecture in terms of GPC-SM-Cache-register-aspects relating to Instructions/etc; there is probably a very good reason why only 8 Tensor Cores per SM.
    Then also a massive complexity how to incorporate that into the CUDA ecosystem if they radically deviate from said architecture and would require multiple different solutions to be learnt by scientists and devs who are just getting to grips with the WMMA API and CUTLASS/CUDA libraries using Tensor cores.
    This is further exacerbated that you still want ecosystem synergy with a key segment and that being for Nvidia Tegra or more specifically the products such as Xavier SOC (Volta CUDA cores) that is to be used out in the world for very large and diverse range opportunities from autonomous driving to transport or warehouse automation amongst other scenarios.

    If an HPC client has to decide between multiple DGX-1 nodes scaled out that has great performance and flexibility or seperate dedicated DL training pureTensor accelerator/separate inferencing pure Tensor accelerator/dedicated mixed-precision accelerator nodes/different ecosystems, then some hard decisions have to be made from ecosystem-performance-power demand and also space perspective.

    I remember the Google engineers also saying the TPU2 was much more efficient than Volta but that turned out not to be true going by a lot of the discussions around, just saying is all.
    Nvidia's design is very performance efficient even as is, and we have not seen the figures for the 150W version that still has some flexibility (if it comes to market), but yeah there are other competitors emerging but not sure they have the same level of resources/influence as Intel/Nvidia.
    Sort of reminds me how ARM is yet to still make it in the HPC segment but for years they have been touted as being a challenger, maybe nextgen Fujitsu but even then we will have to see how that pans out.
     
    #988 CSI PC, Dec 22, 2017
    Last edited: Dec 22, 2017
    firstminion and Arun like this.
  9. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    I think NVIDIA would try to support cuDNN but not CUDA and none of the other APIs if they made a deep learning-only accelerator. Again I'm skeptical it makes sense for them to do that, but it's a sensible option without breaking compatibility with existing frameworks too much...

    I really want to write a longer article about all this but basically... I think the largest inefficiency for GPUs in deep learning is that GPUs are highly optimised for hiding memory latency, because prefetch is typically impossible for graphics workloads. Currently, deep learning is more a DSP-like workload, where it is possible to prefetch all of the data you need in advance into a SRAM buffer (think Cell SPEs if you're not familiar with modern DSPs).

    The interesting thing is that highly optimised assembly-based matrix multiplication kernels on GPUs tend to only have 1 to 4 warps in flight per scheduler (typically 2 IIRC, rather than the maximum 16/scheduler). The kernels are optimised to extract a massive amount of ILP, which GPU compilers aren't smart enough to do on their own (this is a *hard* problem from having worked a bit on it as a hobby while at PowerVR). I'd like to know if that's still the case with the tensor cores, or if NVIDIA is using more warps this time, but I suspect it's something like 2 warps/scheduler - one of the many things I'd like to figure out once I have a card...

    But really because the memory accesses are perfectly known in advance, you don't need any of that. You could just have 1 thread with 1 big block of SRAM and a matrix-multiply unit, and with good assembly, you'd achieve excellent efficiency. You don't need any dynamic scheduling between threads or anything... Looking at TPU2 diagrams, it looks suspiciously like an old-school single-threaded DSP with a big pool of SRAM and a 128x128 matrix multiplication accelerator (scheduled by the DSP, and reading from the SRAM directly).

    I think GPUs can close the gap if the HW engineers optimise power efficiency for those extreme cases, rather than tuning the implementation around gaming workloads with completely different characteristics - e.g. as an extreme case, use a completely distinct HW path if you only have 1 warp per scheduler, and power gate the other scheduling logic! Not a very good example, but you get the idea. There's still one downside though: by having all this idle logic, you've increased the wire distance between the logic you do use, so you've effectively reduced power efficiency by increasing wire distance (lower locality)... I'm not convinced that's a very big effect, but it would be one reason why an AI-only Volta might have a power advantage vs V100.

    The other question is whether AI is really nothing but 8-to-32-bit matrix multiplies. I'm skeptical myself - so there's definitely a risk to making an overly specialised accelerator if the algorithms change significantly! And then it becomes a trade-off between flexibility and performance, just starting from scratch again in a few years if needed...
     
    #989 Arun, Dec 22, 2017
    Last edited: Dec 22, 2017
    nnunn, pharma, firstminion and 3 others like this.
  10. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    332
    Likes Received:
    87
    The answer to this depends entirely on the kind of NN being trained. Image recognition isn't going to get much better on specialized hardware. The standard neural net pattern for it fits incredibly close to the way the neurons in our eyes work, which is all spatial associativity, which is exactly what GPUs are designed to do. Other things might benefit though, recurrent neural nets might need larger caches than GPUs have to store and access previous temporal results more quickly. Hell here's the summary of a paper about, almost, exactly that. Though in the paper they store the relevant information in a memristor "reservoir" that I assume is just on chip.

    Unfortunately that's not what Volta does. In fact I'd compare Volta to Sony's bullshit it tried pulling over a decade ago when they were fairly ascendant across a lot of electronics stuff. They tried making expensive locked in standards to abuse their monopoly, and it failed as competitors went in different directions. Volta's need for exclusive development lock in on already expensive hardware feels like the same thing, if not worse.

    It's not a bet I'd be willing to make if I were a large company, dev time is already the most starved and bottlenecked thing for AI development. Buying a ton of chips might be expensive, but it's just not as expensive as potentially wasting years of multi million dollar salaries developing specifically for an architecture that could suddenly not exist, or get superseded all of a sudden. Much rather stick to more open, standardized things in the expectation that portability will be superior.
     
  11. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Not sure what you mean by "all spatial associativity" - yes, there are deep similarities with the human visual system, but the way it's implemented is quite different - it's still basically linear algebra on very large matrices/tensors. GPUs were not designed for that kind of workload, they were originally designed for pixel shaders which are a mix of scalar and small vector instructions with dependent texture instructions where the address depends on the previous computations. There's nothing like that in deep learning at the moment...

    The way GPUs work at the moment is you can reasonably expect to get all the activations from external memory - for many (not all) workloads the datasets are too large relative to the cache to hit much. What the L2 cache is mostly good for is improving reuse between different SMs reading the same data inside a single layer.

    I really like the idea of using memristors for AI - it feels like a very good fit to some things, and I'm still reasonably excited about memristors despite their relative lack of progress (vs early claims). But that summary seems to imply 91% accuracy on MNIST which isn't very impressive... It's more about something "good enough" at very low cost than a state of the art solution.

    Maybe I'm misunderstanding what you're trying to say, but it sounds wrong to me. All of the deep learning frameworks use NVIDIA's cuDNN framework (which is mostly hand-written assembly by NVIDIA). It was the first non-CPU deep learning API supported by a wide variety of framework, and NVIDIA was deeply involved with the framework developers to add support for it.

    NVIDIA basically applied the same strategy they had with "The Way It's Meant To Be Played" by spending their own engineering resources to help those framework developers - that's why they have widespread support and nobody else does. Some of the newer frameworks like newer TensorFlow and MXNet support systems that allow third party HW vendors to more easily add their own acceleration - but that wasn't the case back then. There's an argument this is unfair in the same way TWIMTBP was unfair, but your claims about wasting years of multi million dollar salaries developing specifically for Volta feels completely implausible to me.
     
  12. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    It’s been a while since I last looked at the cuDNN documentation, but back then it was just a collection of pretty high level API calls with nothing that was particularly linked to a GPU architecture.

    In other words: relatively easy for others to duplicate for their own architecture. I suspect that this is exactly what AMD will do for their cuDNN competitor that will be released real soon now.

    The real value is in the low level optimization of those API calls, but that’s something that would be hard to make portable to begin with ... if you don’t want to give up some performance.

    And I just don’t see how Volta adds a lot of highly specialized lock-in features. A matrix multiplication is about as generic as it gets.
     
  13. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    There isn't a lot of room to make it much more efficient anyways. Keep in mind this is stupidly simple straight line math with perfect predictability. More efficient means as closely tailored to the matrices involved and more efficient process tech. The trade-off is what other capabilities get added to make it more versatile. I wouldn't put it past a large company rolling their own ASIC once R&D figured out their algorithm.

    Could prefetch, but it seems more likely sufficient bandwidth or capacity wouldn't exist. Warps are probably a function of matrix size. Scheduled in parallel until all accumulators are expended. Using a 64 wide SIMD where DSPs couldn't clock high enough to maintain that throughput.
     
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Not sure how you can simplify it that much and correlate how much power/efficiency a design has.
    Would be like saying all GPUs have same power-efficiency with their FMA cores using similar functions.

    One major advantage with the Nvidia approach is how they manage to achieve comparative accuracy to FP32 with the FP16 Tensor operations for training, which requires several steps and an architecture able to support it effectively (linked in the past the Baidu paper showing real world implementation and results).
    The downside is the required tweaking of the loss scaling, more broader real world tests are needed, and competitors may find unique solutions applicable to their own architectures.
     
    #994 CSI PC, Dec 22, 2017
    Last edited: Dec 22, 2017
  15. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Sorry, maybe I'm misunderstanding what you're trying to say, but yes - in addition to the memory latency, for a typical SIMD unit, you need "aggregate SIMD width * ALU latency" data elements running in parallel to hide the ALU latency.

    AMD GPUs do that (effectively - it's a bit more complicated) with thread-level parallelism. NVIDIA GPUs do it with a mix of thread-level and instruction-level parallelism. DSPs do it with just instruction-level parallelism traditionally. Effectively you DMA into SRAM and double/triple/... buffer to hide memory latency, then use ILP to hide ALU/accelerator latency.
     
    pharma likes this.
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Using similar process, precision, and functions they would be roughly the same. Layouts for core functions are fairly well established. It's when the design scales with networking and scheduling differences can appear. In the case of Tensors the designs are exceedingly simple. That's part of why so many companies are experimenting. Performing 256 consecutive multiplications with no dependencies.

    I'm saying the design will be memory bandwidth bound very easily and there would be very little reuse of data. Overlap more a matter of conserving die space. The most efficient design a single cascaded (FP16+FP32) SIMD sized for the largest matrix encountered. Or less precision in the case of TPU. As you mentioned a DSP would work well, but not have the throughput necessary without a lot of them. The design is sized where there are enough flops to cover bandwidth. Nothing bursty like most code. Latency less of a concern as the flow is highly predictable. Complicated scheduling just doesn't exist, and I'd imagine throughput decreases substantially with smaller matrices. No packing or mechanisms to improve utilization.
     
  17. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,930
    Likes Received:
    1,626
  18. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Eh it is far from simple, even Google at a high level has mentioned that and pretty clear with the papers released.
    Also worth noting there is a large discrepancy between TPU2 and V100 in terms of performance/efficiency; bear in mind the TPU2 is 4xTensor processors on a board and possibly equivalent would be at least 2xV100 or if focusing on DL with less mixed-precision flexibility 3-4x half length 150W V100 - and the TPU2 matches your criteria of a simple matrix multiplier coprocessor/accelerator.
    Also you cannot distill electronics engineering into such a simplified process; even with compute the difference between FMA cores design doing 'simple' maths with AMD and Nvidia is massive in terms of efficiency.
    That said simple metrics are not the way to really measure these accelerators.

    If it was that simple to design and easy to do efficienty, then you better start seriously complaining about AMD not implementing this in the MI25 or Navi, or a similar product possibly aligned with Epyc.
    With all the resources Intel has, where is their Tensor accelerator if it is simple to do? - They had to buy Nervana, which looks like the recent announced NNP product still needs to be evolved to catch up with Google and Nvidia.
    There is a difference between doing this in theory and a scalable real world design.
     
    #998 CSI PC, Dec 23, 2017
    Last edited: Dec 23, 2017
    nnunn, pharma, xpea and 1 other person like this.
  19. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,930
    Likes Received:
    1,626
  20. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    332
    Likes Received:
    87
    The logical connections for the human eye neurons, linking between the receptors and the brain, look almost exactly like any basic neural net layer visualization you'd see to today. They are so close you'd mistake the two if you didn't pay attention. It's literally layers of neighboring neurons sending signals to the next layer up and so on and so on based on received signals. Just suggesting the reason GPUs happened to be better than CPUs is because GPUs were designed to work on producing visuals, which are processed by a set of real neurons hooked up in tight neighboring layers. So running a set of virtual neurons in tight neighboring layers is just doing something similar enough that a GPU ended up being pretty good at that as well.

    This is especially true with training NNs. For deploying you can sit there and do your giant matrix calculations with specialized equipment much easier, but training NNs works so well on GPUs that major companies are happy to keep buying them instead of creating specialized chips (at least deploying NN's was the priority over replacing GPUs for training). But ithat's only true for today, as neural nets today, as mentioned, look (organizationally) a lot like the visual neurons we have attached to our eyes. Initial research suggests our brains look rather different in how the connections are laid out. And this could be why neural nets today can do better than humans at recognizing images, yet spend hundreds of thousands of hours and they still can't drive to save it's (or our) life. And why you need something beyond a basic NN, like recursive networks, to translate speech and etc. IE neural nets can do better than trained doctors at spotting problems in scans, but can't drive with a thousands sensors while a 16 year old (hopefully, it's half a joke) can drive with just two eyes.

    I'm basically arguing that Nvidia has added a lot of features that are, and may always be, only for Nvidia. AMD is trying to get into the market (And failing so far), and Intel's GPUs, and other chips, will probably be based heavily on training neural nets. But they probably also won't have the exact features put in by Nvidia so any work you do for Volta, any research for it, is simply non transferable. And Nvidia's CUDA libraries were even worse. You got to set up Neural Nets faster, but hey now you're locked into Nvidia (they hope) too bad you can't ever run those programs on anyone else's hardware! Come buy our $5k card because you spent all those months writing for our own tech and have no choice! The point is simple, Nvidia didn't build their libraries out of kindness. They did it because they gambled they'd make more money off it, by locking people in, than it cost in the first place.

    Nvidia didn't make anything you couldn't with OpenCL, they just made it exclusive to them and tried lure you into their proprietary ecosystem. And that spells nothing but trouble, it's what Sony used to do. Buy products that only work with other Sony products! It's what Apple and Android both did, or tried to, with their apps, you'll hesitate to switch if you've invested hundreds in apps that suddenly won't work anymore (not that anyone buys apps other than games anymore, and those are F2P so who cares). Point is, they lure you in by making it seem easy, then trap you by trying to lock all the work you've done exclusively to their hardware.
     
    Grall likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...