Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Discussion in 'Architecture and Products' started by Geeforcer, Nov 12, 2017.

Tags:
  1. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    843
    Thats Interesting.
    There is definitely 6 GPC as that is the maximum design for Maxwell/Pascal/Volta; the architecture has been scaling via SM-TPC/Polymorph per GPC.
    What can be a bit vague is where they disable the SMs.
    Because of the sharing TPC-polymorph one could consider it as 7 SM per GPC as the TPC/Polymorph and SM are no longer a 1:1 relationship; context here specifically geometry rather than compute and yeah I appreciate it is not fully accurate but helps to keep the sharing aspect performance in perspective when comparing to the other Nvidia GPUs, which your tool even accounting for this shows the V100 still not performing ideally and below expectation.
    Outside of P100 and V100 all SM are meant to have a 1:1 relationship with the associated geometry engines as per Fermi, which you see with your result for GP102.
    I think those GV100 results comes back to the pros/cons that Nvidia very briefly touched upon at one point when asked about using that SM-GPC setup for gaming, trying to find if it was ever noted publicly.
    You know anyone you can share your code with that has access to P100 (maybe as a Quadro GP100) and can run their own code to see if the behaviour aligns with the V100?

    Really nice tool there, especially as it is identifying quirks with the V100 design.
     
    #141 CSI PC, Mar 7, 2018
    Last edited: Mar 8, 2018
  2. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,618
    Likes Received:
    123
    Off-topic, but when should we expect Nvidia's first 7nm GPUs for the GTX GeForce gaming market, late 2019?
     
  3. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    268
    Likes Received:
    68
    That's as good a guess as we'll get for a while.
     
  4. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,318
    Likes Received:
    418
    Location:
    Texas
    Given how long it’s taking for the pascal successor to come out, I would expect 18-24 months after that.
     
  5. borntosoul

    Newcomer

    Joined:
    Oct 9, 2002
    Messages:
    201
    Likes Received:
    8
    Location:
    Au
    This is the most sensible way of looking at it, where space and efficiency are at a premium why put things in a card that won’t be suitable for gaming.
     
  6. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    757
    Likes Received:
    195
    Thanks for the good info and speculation. I am wondering about some things (maybe they'll be answered in your blog post…):
    1. How large are the potential "significant power efficiency gains" for the Tensor Cores?
    2. If there is FP8 support, would the FP8 be twice the rate of FP16? (Also would this be for the Tensor Cores?)
    3. The Google TPU2 has a significantly lower FLOPS/byte than the V100 (75 GFLOPS/byte vs. ~130 GFLOPS/byte). Do you expect the HPC/AI chip to also have a lower FLOPS/byte than the V100? I've been wondering if NVIDIA may go with > 4 HBM2 stacks or some kind of multi-GPU solution to increase bandwidth, since 4 stacks with even the new 2.4 Gbps HBM2 results in a maximum of 1.2 TB/s. I'm assuming that the HPC/AI chip has a minimum of 2x the FLOPS/W of the V100 (I estimate this lower bound mainly from the process), and not only is 1.2 TB/s "only" 37% more than the V100's 900 GB/s, but the V100 is already bandwidth-limited (or close) according to this post.
    4. For the HPC/AI architecture, do you envision a single chip with both fast DP and lots of Tensor Cores, or one DP-focused chip and another Tensor-focused chip?
     
    ImSpartacus, pharma and nnunn like this.
  7. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    26
    Likes Received:
    20
    Since stacking HBM2 is so complex, for a GA102-level card, what factors would stop NV hooking up 18 Gbps GDDR6 to a 512 bit controller? Given the imbalance between bandwidth and compute, plus power savings since their last 512 bit controller (GTX 285?), imagine a new 512-bit GTX getting > 1TB/s before the HPC cards. :shock:
     
  8. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,737
    Likes Received:
    1,970
    Location:
    Germany
    There's no necessity right now and it would eat into their margins. That's what would stop them from my point of view.
     
  9. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,318
    Likes Received:
    418
    Location:
    Texas
    I’m not sure the next Ti card would need a 512-bit bus. GP102 in a 384-bit bus only uses 11 Gbs GDDR5x memory.

    Keeping a 384-bit bus and utilizing GDDR6, the bandwidth could be increased by over 50%. GDDR6 tops out at 18 Gbs speeds.
     
  10. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    843
    I doubt we will see Nvidia ever going higher than 384 bit controller with GDDR6, could be argued the Tesla P40 ideally should had been 512 bit controller as it was promoted as the card for maximum inference throughput servers (pre-Volta) and other FP32 HPC requirements but had limited bandwidth due to 384-bit GDDR5 (not GDDR5X like Geforce but even that could be deemed too little).
     
    #150 CSI PC, Mar 11, 2018
    Last edited: Mar 11, 2018
    nnunn likes this.
  11. huebie

    Newcomer

    Joined:
    Apr 10, 2012
    Messages:
    29
    Likes Received:
    5
    I wouldn't bet that the 21 is the physical number of geometry engines since the Titan V is a cut down chip and this should be a fixed wired unit (and not a network approach). Did you considered this option? The visible trinangle per clock is slightly above of 2 because the frequency is not always 100% stable since a couple of generations (and it seems to fluctuate more on newer ASICs). How did you fixed the clock? Via software?
     
  12. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,737
    Likes Received:
    1,970
    Location:
    Germany
    What is a geometry engine?
     
  13. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    843
    For me I would say.
    At a very high level the polymorph engine (yeah name rather simplifies everything it does) and relationship with the SM; in context of Arun's tool this shows the 1:1 relationship in Nvidia's GPU (validated with his GP102 result) apart from V100 and probably the P100 if it could be tested where both those have a 2SM per 1 Polymorph engine and the associated overhead/sharing contention it creates.
    Just to say worth noting the Polymorph engine and all its functions were moved into the TPC since Pascal rather than integral to SM as with Maxwell and earlier; reason is the evolution we see with P100 and V100 (changes to ratio of CUDA cores per SM,SM per GPC, associated register,etc).

    At a more in-depth level comes back to the foundations set in place with Fermi for the Polymorph engine-raster engine-SM:
    Very late Edit:
    And the tool IMO is finding possibly some of the cons in the setup of both P100 and V100 in context of geometry performance with their ratio and sharing-contention.

    One aspect interesting and a consideration going forward with the raster engine is that it is one per GPC and originally designed around 4SM with 4 Polymorph engines per GPC and 4 GPC in total, the GPC has increased to 6 maximum in the largest designs since Maxwell onwards.
    However and importantly as the arch continues to scale how has Nvidia changed this internally for Raster Engines as we now have 14SM (in gaming would be setup as 7SM) and 7 Polymorph engines per GPC with Volta.
    Pascal increased it from 4 to 5 for the SM and Polymorph engines per GPC.
    A notable increase of throughput to each Raster Engine/unit since Pascal if they did not revise it heavily from Fermi.
     
    #153 CSI PC, Mar 18, 2018
    Last edited: Mar 18, 2018
  14. huebie

    Newcomer

    Joined:
    Apr 10, 2012
    Messages:
    29
    Likes Received:
    5
    Agreed and since the geometry stages in the pipeline aren't decoupled from the topology there is no feasible solution or explaination for 21 GE. With 28 you are fine when dividing it by 7 which is 4 (math skill +10.000 :D).
     
    Man from Atlantis likes this.
  15. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    843
    With the latest announcement around DX12 Ratracing at GDC 2018.
    One quote stands out from Nvidia and is from Tony Tomasi in an article:
    https://www.pcgamesn.com/nvidia-rtx-microsoft-dxr-raytracing

    Yeah not going to impact gamers for some time as a complete solution, but will be interesting to see how this will unfolds sooner in the professional world, especially with Volta onwards.
     
    pharma and nnunn like this.
  16. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,737
    Likes Received:
    1,970
    Location:
    Germany
    Regarding the bolded part of your quote: Even larger caches would already accelerate raytracing, so that's basically a non-statement until elaborated upon further by Nvidia.
     
  17. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    843
    Cache is not a functionality though *shrug*, going this route though you might as well say Volta has smaller CUDA core to SM/register/etc ratio as applicable as functionality, but that is specifically V100 rather than say Volta architecture generally; yeah depends if one can actually differentiate in same way like one can with Pascal and P100.

    If it was Cache, then they would not be so hesitant to comment on that being the functionality as it is already a known factor.
    Edit:
    I linked earlier the performance gains between V100 with and without AI denoise/reconstruction, the gains are considerable and so taking the architecture such as Cache/SM-register/etc structure out of the equation.
    This was in the Volta speculation thread.
    With the AI-Tensor aspect the gains going back to 2017 were 8x at an SSIM rating of 0.93, 4.8x at SSIM rating of 0.95.
    The solution has matured since then and that demo using rendered Bistro, so this does come back to HW functionality rather than caches IMO.

    [​IMG]

    I included some other links in the Volta speculation thread.
     
    #157 CSI PC, Mar 20, 2018
    Last edited: Mar 20, 2018
    nnunn likes this.
  18. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,737
    Likes Received:
    1,970
    Location:
    Germany
    Why am I not surprised …
    In terms of marketing, it is.I'm not saying that there isn't something else, but larger caches would suffice for that quote not to be a lie, ergo is the quote as given above basically worthless.
     
    #158 CarstenS, Mar 20, 2018
    Last edited: Mar 20, 2018
  19. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    843
    Well see my Edit response with performance figures of V100 with and without AI-denoise/reconstruction.
    It goes beyond cache/SM/registry/etc
    I really do not think he is inferring cache.
    Otherwise they might as well say they have functionality in Volta for Amber.
     
    #159 CSI PC, Mar 20, 2018
    Last edited: Mar 20, 2018
  20. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,737
    Likes Received:
    1,970
    Location:
    Germany
    He is not inferring anything, but giving a marketing answer that's within the scope of his briefing and at the same time giving the impression of having adressed the question asked.

    Tonys surname, btw, is Tamasi I believe, not Tomasi as in the pcgamsn article.
     
    #160 CarstenS, Mar 20, 2018
    Last edited: Mar 20, 2018
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...