Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Well star citizen, that is just an exercise in bad project management lol (I don't think they ever hit a single drop date yet for any of their modules), designers should never be project managers or producers, sorry just the way it is, designers want to get as much things into games just because its cool and a great idea. Its just like making an artist into a project manager, they are never happy with their work, they feel like they can do better. Sometimes bugets and timings are more important than giving the best of all aspects.
     
    CSI PC, Lightman and Cat Merc like this.
  2. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Can we expect a thread when it does?
     
  3. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
  4. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
  5. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    oh yeah of course, by end of this year or so :) working up to a kick starter campaign, but need to get certain things up and done, so its not like what happened with star citizen :).

    I'm the only one on my team now working, just finishing up as much of the art assets as possible, before the programmers start doing there work. Design is pretty much complete and on paper everything seems to be evening out on the RPG side, still working on the RTS side of things but that module will be an expansion of the game and won't see day of late till a couple of years after the single player and multi player are done. MMO will come around the same time too, and the math for that is pretty much equalized on the game design portion.

    So many moving parts, but I'm going on track with what I started with last year.
     
  6. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
    I haven't seen this Micron press release posted yet.

    https://www.micron.com/about/blogs/2017/june/what-drives-our-commitment-to-16gbps-graphics-memory
    • GDDR6 will feature 16 Gb die running up to 16 Gbps.
    • GDDR5X has hit 16 Gbps in their labs.
    Their speed projections are below. No 16 Gbps stuff until 2019 (though I seriously doubt anyone expected it sooner).

    [​IMG]

    And then they start digging into the differences between GDDR5X and GDDR6.

    [​IMG]

    It looks like GDDR5X might have a place in future markets for Micron. We haven't seen 16Gb GDDR5X yet.

    I'm interested in how Volta will use one or both technologies in varying parts of their lineup.

    EDIT And just to properly source things, I got this from WCCFTech. They aren't without some merit!

    http://wccftech.com/micron-gddr6-gddr5x-memory-mass-production-early-2018/
     
  7. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    If you stick a keyboard in front of an untrained immortal monkey and let it just hammer at it for eternity it will eventually churn out the complete works of Shakespeare, so there's that.
     
    Cat Merc and Razor1 like this.
  8. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    865
    Likes Received:
    267
    There were experiments, and they failed. It seems this is a myth actually. ;)
    I would compare it to "countable infinities'. Yes, it has a "real probability", that doesn't mean it can be achieved in reality. You can't count infinite natural numbers in reality.

    On the other hand: :D
    [​IMG]
     
    CSI PC, nnunn, DavidGraham and 7 others like this.
  9. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,424
    Likes Received:
    2,076
    Volta DXG Station package pricing - $69,000
    https://www.microway.com/preconfiguredsystems/nvidia-dgx-station-deep-learning-workstation/
     
    ImSpartacus, CSI PC, nnunn and 3 others like this.
  10. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Nvidia would still need to provide a more traditional way for GEMM/cuBLAS.
    This looks like it can be seen with these actual results using the updated Caffe2 framework that supports the updates to cuDNN7 and TensorRT, separate to actual Tensor cores.
    A key difference is how Caffe2 supports FP16 for training now with cuDNN7 available with Volta.
    As these are actual results also need to take into consideration the increased core count with V100.

    [​IMG]

    [​IMG]

    So yeah it looks to still be there, and would be a shocker if they removed it tbh on the V100 as it just would not make sense, they did not talk about it IMO as the focus is on new features such as the mixed precision Tensor cores.
    Point is just to show Vec2 looks to still be there for V100, not regarding the facts about P100 in this instance.
    Cheers
     
    #370 CSI PC, Jun 5, 2017
    Last edited: Jun 5, 2017
    nnunn and Razor1 like this.
  11. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    I'd agree, but it might be separated exclusively into the tensor cores. Packed math was originally for accelerating the deep learning which a tensor core obviously is designed to do. It still seems an interesting omission and there is still a question of if the feature arrives on consumer products. Which has baring on the whole discussion of which architecture is targeted.

    Need this discussion moved to a more relevant thread.
     
  12. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    FP 16 calculations are part of nV's mixed precision capabilities in Pascal, another words the ALU's can do it, they are not removing that in Volta. Why would they remove that in an already done pipeline and just do it with Tensor cores? They already have it, they don't need to mention it again. They even added it to maxwell tegra x1.
     
    pharma, DavidGraham and CSI PC like this.
  13. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,915
    Likes Received:
    2,237
    Location:
    Germany
    What about the possibility that those „cores“ (FP32, FP64, Tensor, INT32) are not as distinct as Nvidias depicts them in the first place? Most of it could be a matter of data flow.
     
    nnunn likes this.
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I would say the Tensor cores are distinct primarily due to the much improved clock cycles involved with its mixed precision instruction-operation.
    I do think though the Tensor cores can do more than what they have been presented with to date albeit still with said matrices structure (and some of the high experienced CUDA dev feel the same way in the forum), anyway here are some of the more well known CUDA devs perspective on Tensor cores.
    In fact just reading it supports my previous post that those Caffe2/ResNet/etc results are Vec2 packed FP16:

    His later post also shows the difference in approach with regards to mix precision operation between the traditional 'FP32 CUDA' core and Tensor core and how they operate with input/output/compute precision.
    The Tensor core would have greater throughput than SGemmEx.

    txbob has been around a long time on the Nvidia Devtalk forum and has a lot of experience with Nvidia products and CUDA, one of the best for information IMO.
    https://devtalk.nvidia.com/default/topic/1009558/perfomance-question-for-tesla-v100/?offset=4

    Cheers
     
    #374 CSI PC, Jun 6, 2017
    Last edited: Jun 6, 2017
    Anarchist4000 likes this.
  15. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,915
    Likes Received:
    2,237
    Location:
    Germany
    It even says to in the foot notes
    This of course again is apples and oranges, since we all deemed GP100 to also support 2×FP16, right?


    But that's not what I was getting at. I am talking about shared circuits between those „cores", i.e. the Tensor-part re-using some of the adders and/or multipliers from the „regular“ FP32 and/or FP64-ALUs.
     
    nnunn and Anarchist4000 like this.
  16. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,424
    Likes Received:
    2,076
    FP16 calculations are in Volta as Deep Learning TensorRT software mentions 3.5x faster inferencing using full (FP32) or reduced precision (INT8, FP16) when comparing Tesla V100 vs P100.
    https://developer.nvidia.com/tensorrt
     
  17. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Well I am emphasising the point as confirmation albeit specific to V100 as we do not know anything about the lower models; this conversation initially came about debating whether V100 still supported Vec2 packed FP16 compared to Vega as raised earlier by Anarchist.
    The last post did say " It still seems an interesting omission", so making the point with the information provided by txbob makes sense to me.
    But anyway txbob provides further interesting background beyond this.
    Cheers
     
    #377 CSI PC, Jun 6, 2017
    Last edited: Jun 6, 2017
  18. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Higher performance doesn't imply the packed math though. It could simply come down to a wider processor with improved clocks. We already know the chip is a good deal larger with more SIMDs.

    That's what I've been wondering. Especially with the parallel INT32 pipeline. Tensors logically might be replacing the traditional operations to keep the design compact. Ruling out certain instructions would also allow clocks to change. Possibly have the packed math, but results fed to that integer pipeline for FP32 math. Tensors and FP32 math wouldn't run concurrently under those conditions. Nvidia is just presenting the capability differently on the diagrams.

    Not necessarily given the temporal nature of the tensor operations. As I suggested as a Vega possibility, the most efficient arrangement would be a FP16 ALU alternating between two adders. FP32 taking longer to propagate than FP16 and piping data through that arrangement. Refactor the timing around FP16 with clocks doubling. Standard FP32 operations then taking two or more cycles to complete. A tensor core being two SMs working together in a special arrangement. The actual tensor math was far simpler as established in the tensor thread. Without the data sharing it should be straight forward.
     
  19. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    RPM is just another way of saying 2 fp 16 operations in one 32 fp unit.

    The only thing making nV's Volta die bigger is its tensor and DP cores, take those out, it seems like its going to be around 600mm2, which by all accounts is close to what Vega is to be (Volta has more ALU's).
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,348
    Likes Received:
    3,879
    Location:
    Well within 3d
    There may be considerations between the FP and INT units that could keep them separate, since there is now dual-issue between the unit types and possibly variations in how their scheduling hardware would work.

    The data flow for the tensor unit is one area that (if we take Nvidia's statements about operations per cycle at face value) could have a significant need for specialized routing.
    They do outright say it in their blog about Volta:
    Getting 64 parallel multiplies sourced from 16-element operands, followed by a full-precision accumulate of those multiplies and a third 16-element operand into a clock cycle efficiently and at the target clocks sounds non-trivial. Shoehorning that into the existing pipelines may not be good for them, either.
     
    pharma, DavidGraham and ieldra like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...