Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Discussion in 'Architecture and Products' started by Geeforcer, Nov 12, 2017.

Tags:
  1. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    I wonder if it's feasible for NVIDIA to have a "compute-only" big chip on 7nm (sold at high margins) and keep the gaming line-up on 12nm?

    Volta was announced at GTC 2017 but barely had any availability before Q4 2017. It could be similar for a 7nm successor in theory - but I'm not sure if that's not still too early for 7nm (even with a significant part of the chip turned off for yields)?
     
  2. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    You can train with the quoted TFLOPS on Volta. For example, large LSTM models do quite well.
     
    pharma likes this.
  3. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Why is Volta not a new gaming architecture, and what would it take for something to be a new gaming architecture?
     
  4. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Volta though was the project completion milestone for the Pascal P100 and designed to go live on schedule for their massive HPC contract obligations meaning Q3-Q4 2017; Nvidia presentations I have seen regarding Volta all point to HPC-scientific, none associated with Tesla or Quadro more generally.
    The surprise IMO was the Titan support card for the V100 was the same die-GPU when it would be more likely to expect a 'GV102' type design, this suggests that either Nvidia could not delay this cheaper support card (look to why Titan P was originally launched) to align with next node size or there is a distinct break between Volta with its full mixed-precision/Tensor acceleration and the rest of the model range.

    But like I mentioned earlier and hinted by some posts after yours, there is synergy between all 3 segment models and that being Geforce/Quadro/Tesla; the gaming design HW scientists/engineer at Nvidia feed into the other engineer teams what they would like within the achitecture at R&D time, this results in it being pretty tight between all three segments in terms of architecture.
    Worth remembering there is also Quadro that bridges features/design choices and solutions used in both gaming and also Tesla, so breaking them apart becomes even more difficult and complex.
    Costs and logistics to try and do this would be high and not sure it really makes sense considering how well Maxwell and especially Pascal has done in all segments, that said maybe they will have same architecture spanning 7m and 12nm *shrug*, but I do not think it worked out well generally when they tried something similar (arch rather than node size) with Maxwell Gen 1 and Maxwell Gen 2 (which is what was successful).

    Just for context, Titan P and Titan V are cheap when considered for their primary role within academic and scientific labs supporting directly or indirectly the P100 and V100.

    Edit:
    Just for clarification and to add, also it could be debated that some in the model range do have some distinction between each other due to the Compute capability-version supported by each GPU; simple case in point the P100 (SM_60) does not have the same CUDA compute version as the GV102 (SM_61) where one supports native FP16 acceleration while the other supports vector dot products DP4A and DP2A.
    The compute version will also be different for those with the Tensor cores.
    Just mentioning as it is one of perspective with regards to naming and design change, too early to put any weight on how large the differences are between two model names and the professional one to that of Volta (IMO is distinct for current signed HPC-scientific full mixed-precision with Tensor projects).
     
    #124 CSI PC, Mar 4, 2018
    Last edited: Mar 4, 2018
  5. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I went back through some of the Nvidia presentations given late last year and early this one.
    Of interest would be TensorRT 3 that is Nvidia's programmable inferencing accelerator/optimiser going forward, currently it shows a push for Tesla P4, Tesla V100, Drive PX2, Jetson TX2; Xavier is mentioned further down but should be part of the slide.
    So this may suggest the 2+ Tensor cores per SM is in what replaces the P4 and P40 along with any other changes, Volta will probably be distinct with its full mixed precision capability while the P40 replacement will be faster with regards to FP32 compute but potentially lessTensor throughput due to SM-cache-BW differences, the replacement for P4 would be the scale-out efficient GPU.
    With 2 Tensor per SM minimum not sure where that leaves the INT8 DP2A and DP4A vector products going forward and will Nvidia drop them *shrug*; at some point I assume Nvidia would want the Tensor cores to support such possibilities in future as an Inferencing option.

    If they do the 2+ Tensor cores per SM it would also suggest that native FP16 acceleration will be coming down the chain as well, and possibly be part of Geforce as it would be more of a CUDA core feature albeit one that ties loosely also into Tensor.
    For Geforce/Quadro they could disable those Tensor cores for the top GPUs (GP104/GP102 replacements for Geforce and Quadro) or do another die-GPU without them (will come down to costs/manufacturing/logistics); anything lower would not have the Tensor cores anyway looking at how Nvidia position Tesla and models.

    Key point is that TensorRT3 is designed to make the most of Inferencing using Tensor cores, so looking at the positioning of Nvidia products with TensorRT kinda makes sense to see the 2+ Tensor cores per SM in what replaces the existing GP102 and GP104 Tesla GPUs.
     
    #125 CSI PC, Mar 4, 2018
    Last edited: Mar 4, 2018
  6. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    324
    Likes Received:
    84
    A noticeable improvement per watt, or something, for gaming? We know the Titan V doesn't make a huge difference over a Titan XP despite being nearly twice the die size and same TDP. That's not a market proposition for gaming. It does however do very well better in perf/per watt for pure compute heavy tasks. Which is great if you want that. It's the same reason it's possible there's 2 new architectures, one for gaming and one for compute, coming from Nvidia instead of one.
     
  7. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    6,980
    Likes Received:
    3,063
    Location:
    Pennsylvania
    You can't compare Titan Xp and Titan V for gaming, from a die size perspective, just because they have the same marketing family.
     
  8. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Just to add further perspective that highlights the GPU scope-purpose.
    Only has 33% more FP32 CUDA cores compared to the Titan Xp though, but yeah even then it is not scaling correctly with games for many reasons.
     
  9. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    FWIW, in UHD resolution we saw +0% in GTA 5, +16% in Wolf2, +19% in Witcher3 and +68% (!) in Quantum Break:
    http://www.pcgameshardware.de/Kompl...ntel-Core-i9-7900X-Nvidia-Titan-V-1250265/#a4

    Not in the test, but the scene demo 2nd stage boss, which basically is purely compute, is twice as fast on a Titan V compared to an OC'ed GTX 1080 Ti (2,000e-6,000m) and also (a little bit, ~5%) fast [edit: er] than the former dominator in this test, the RX Vega 64 LCE. It was a showcase for Vega all along.
     
    #129 CarstenS, Mar 6, 2018
    Last edited: Mar 7, 2018
    fellix, Lightman, pharma and 2 others like this.
  10. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Quantum Break results show just how bad that game was in terms of optimisation for Nvidia hardware and I cannot help but think how many times AMD complained about games optimised for Nvidia (seems rather cynical to me as this was one of the games AMD promoted for its performance and DX12 design), very interesting result especially as it is the DX11 version.

    Regarding good performance scaling, Hellblade is another that scales pretty well on Titan V but more so at 1440p; PCPer measured that at 38% gain over Titan Xp at 1440p and 29% at 4k.
    https://www.pcper.com/reviews/Graphics-Cards/NVIDIA-TITAN-V-Review-Part-1-Gaming/Hellblade
    Their Witcher 3 pretty much aligns with PCGamesHardware.
    Good performance scaling on Titan V for games it seems is rare though.
     
    #130 CSI PC, Mar 6, 2018
    Last edited: Mar 6, 2018
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Interesting, our Hellblade-scene indicates rather mediocre scaling between TV and TXp, but we're using a lower-fps scene anyway (~102 fps in WQHD for TV, 83 for TXp).

    Apart from outlier Quantum Break, Elex and Sniper Elite 4 are able to utilize the TV's ressources better than average, and of course our application test in Capture One (raw converter), where TV eclipses the other cards as well by a very healthy margin despite being limited to it's OpenCL boost of only 1.335 MHz.
     
  12. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,166
    Likes Received:
    1,836
    Location:
    Finland
    So because AMD isn't happy about x^n games optimized for NVIDIA (including dirty tricks) one game optimized for their design (being developed for console using their design) means they shouldn't have ever complained :roll:

    edit: http://www.pcgameshardware.de/Gefor...178/Tests/Benchmark-Review-Release-1242393/2/
    Looking at the results at tad lower resolution, Quantum Break DX11 was never badly optimized for NVIDIA to begin with, the results are pretty well in line with the usual performance difference of the cards, Volta just has something the game really really likes - much more than it likes AMD hardware
    For DX12 there's little benchmarks to draw conclusions from, but AMD has traditionally been relatively stronger in DX12 than 11
     
    #132 Kaotik, Mar 6, 2018
    Last edited: Mar 6, 2018
  13. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,750
    Likes Received:
    2,519
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Not sure myself as AMD do not have great results due to it being DX11 back before they started to do driver improvements with DX11 and were focusing on DX12 with devs.
    Case in point look at all those early DX12 games and then compare their DX11 performance with AMD cards.
    Another factor is just how much the gains are for Volta over Pascal & Maxwell in this very specific title that has been bad for Nvidia in general, other games that are heavily optimised and designed from scratch for both hardware has gains one expects and that is under 33%; case in point is Wolfenstein 2 that still has a fair amount of emphasis for AMD but still designed for Nvidia hardware.

    Some tests using Presentmon showed the Fury X being 20-25% faster than the 980ti in Quantum Break DX12.
    I appreciate this comes down to scene tested and importantly the volumetric lighting and Global Illumination (which is what killed Nvidia performance).
     
    #134 CSI PC, Mar 6, 2018
    Last edited: Mar 6, 2018
  15. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,750
    Likes Received:
    2,519
    But the reverse happened in DX11:


    [​IMG]

    [​IMG]

    https://www.computerbase.de/2016-09/quantum-break-steam-benchmark/3/

    It could also be argued that NVIDIA suffered from the DX12 implementation of the game.
     
  16. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah Nvidia suffered with most of the early DX12 implemented games as they were more aligned with AMD, some of the more recent ones show Nvidia can actually perform ok with DX12.
    Hitman is another classic example that took ages (well after launch) for it to work well with Nvidia even using DX11 and ignoring DX12, and even then it can be Chapter dependent but at least it did improve its DX11 performance very late in the day.
    Regarding this game.
    The 1st chart also has very different results to PCGameshardware DX11 result; Computerbase manage 31% in favour of the 980ti while PCGamesHardware Fury X was 2% faster in DX11, both at 1440p.
    Not a fan of GameGPU myself.
    Comes back to settings and scene as Nvidia was hammered by Volumetric Lighting and Global Illumination.
     
    #136 CSI PC, Mar 6, 2018
    Last edited: Mar 6, 2018
  17. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    A few quick points:
    1. According to my tests, Titan V has less geometry performance: 21 geometry engines vs 28 geometry engines on Titan X (1 engine shared per 4 SMs vs 1 engine per SM).
    2. I've done some low-level analysis of the Volta tensor cores (including instruction set), and my personal opinion is it's a somewhat rushed and suboptimal design. I think there's significant power efficiency gains to be had by rearchitecting it in the next-gen.
    3. There's some interest in FP8 (1s-5e-2m) even for training in the research community and Scott Gray hinted it would be available on future GPUs/AI HW in the next year or two; so it's likely NVIDIA will support it, let alone because Jen-Hsun likes big marketing numbers...
    4. groq (startup by ex-Google TPU HW engineers) looks like they might be using use 7nm for a chip sampling before the end of the year; so there might be some pressure for NVIDIA to follow sooner than later. Remember than 7nm might be (much?) less cost efficient initially, but should still be more power efficient.
    My personal guess is that we'll have 2 new NVIDIA architectures in 2018, both derived from Volta (e.g. still dual-issue with 32-wide warps and 32-wide register file but 16-wide ALUs) with incremental changes for their target markets:
    1. Gaming architecture on 12nm (also for Quadro). Might include 2xFP16 and/or DP4A-like instructions for inferencing, but no Tensor Cores or FP64. Availability within 3-6 months.
    2. HPC/AI architecture on 7nm with availability for lead customers in Q4 2018, finally removing the rasterisers/geometry engines/etc... (not usable for GeForce/Quadro).
    I'm going to write up what I've found out about the V100 Tensor Cores in the next week or so and hopefully publish it soon - probably just as a blog post on medium, not sure yet... (haven't written anything publicly in ages and sadly the Beyond3D frontpage doesn't have much traffic these days ;) other suggestions welcome though!)
     
    Alexko, nnunn, ImSpartacus and 10 others like this.
  18. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Well Titan V has 7 TPC per GPC to Pascal's 5 per GPC, one difference to other models is that the SM structure of the V100 like P100 is that it has 2xSM per TPC meaning it is shared; comes down V100 and P100 64 Cuda FP32 cores per SM while the rest of Maxwell/Pascal are 128 Cuda FP32 cores per SM causing this divergence.
    I am pretty sure Nvidia very briefly in the past touched upon pros/cons of this and may have consideration for geometry and SM-TPC-Polymorph engine.
    Going back to the Fermi whitepaper where they talk about the Polymorph engine and SM.
    Quite a lot still seems applicable to current GPC-SM-TPC (with Polymorph engine) design.
     
    pharma and Malo like this.
  19. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Yep, the GPC-SM-TPC design still seems applicable, but one twist is that it looks from my benchmarks that there's 21 geometry engines on V100, which would be 1 per 4 SM... while there's only 80 SMs out of 84 enabled, so all the geometry engines are active despite not all the SMs being active. And more confusingly, there's supposedly 6 GPCs according to NVIDIA diagrams (I haven't tested this) which means 14 SMs per GPC... but 14 isn't divisible by 4! So either the association between SMs and GPCs is orthogonal to the association between SMs and geometry engines, or possibly some SMs just don't have access to geometry engines - i.e. it has become impossible to have warps running on all SMs purely from vertex shaders (with no pixel or compute shaders). I don't know which is true (assuming my test is correct) and it doesn't really matter for me to spend any more time on the question, but it's still intriguing.

    Sadly, I don't have any good low-level tessellation tests to run - the tests I'm using for geometry are actually the ones I wrote back in 2006 to test G80(!) and they're still better than anything publicly available that I know of - a bit sad, but oh well... :)

    EDIT: For anyone who's interested, the results of my old geometry microbenchmark on the Titan V @ 1200MHz fixed core clock...

    Code:
    CULL: EVERYTHING
    Triangle Setup 3V/Tri CST: 2782.608643 triangles/s
    Triangle Setup 2V/Tri AVG: 4413.792969 triangles/s
    Triangle Setup 2V/Tri CST: 4129.032227 triangles/s
    Triangle Setup 1V/Tri CST: 8347.826172 triangles/s
    
    CULL: NOTHING
    Triangle Setup 3V/Tri CST: 2445.859863 triangles/s
    Triangle Setup 2V/Tri AVG: 2445.859863 triangles/s
    Triangle Setup 2V/Tri CST: 2445.859863 triangles/s
    Triangle Setup 1V/Tri CST: 2461.538574 triangles/s
    i.e.~21 indices/clock, ~7 unique vertices/clock and ~2 visible triangles/clock (not completely sure why it's very slightly above 2/clk; could be a bug with my test). NVIDIA's geometry engines have had a rate of 1 index/clock for a very long time, which is what implies there's 21 of them in V100 (same test shows 28 indices/clock for P102).

    I think it's likely that any future gaming-centric architecture will have a higher geometry-per-ALU throughput than V100; this is maybe a trade-off to save area on a mostly compute/AI-centric chip? Or maybe NVIDIA decided the geometry throughput was just too high, which is possible although it feels a bit low now relative to everything else...
     
    Cat Merc, Putas, pharma and 6 others like this.
  20. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    324
    Likes Received:
    84
    Not initially. New nodes are getting worse and worse at transition times and initial reliability, eg cost per transistor isn't going down near as much as it used to. I wouldn't expect a huge die on 7nm for a while. Supposedly AMD is trying MCM stuff for its 7nm, but who knows if they're still going for that after firing Raja. And AFAIK Nvidia isn't anywhere close to such. Maybe that's why they went with 12nm? Volta proved they aren't thermally limited if they can have a colossal die. Could be Nvidia's more comfortable putting out large die sizes on 12nm with a nice gain in performance per watt, as compared to being limited by die size in 7nm which initially has huge benefits for power/thermal stuff anyway (EG what AMD is limited by).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...