Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    They kinda demo'd it last year separate to DXR-RTX, involved a car (not the holodeck stuff) but some of the demos in the other thread suggest they are using this real-time solution with Volta in their demos as well.
    https://forum.beyond3d.com/threads/directx-ray-tracing.60670/
     
    #1061 CSI PC, Mar 19, 2018
    Last edited: Mar 19, 2018
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,128
    Likes Received:
    2,887
    Location:
    Well within 3d
    Just off-the-cuff musing on this is applying something analogous to reconstruction from nearby samples in prior frames, or using some of the meta-type properties of the surface or object to create more locally-accurate rules for interpolating values.
    In a checkerboarded method, there's accumulated information stored in the form of render targets written earlier, while in a trained network there's an element of prior history being built into the weights in the network in addition to other data.

    A network could find other correlations about a surface or specific properties that tend to behave similarly or are not perturbed within certain bounds.
    Perhaps a number of local networks will pick up that a given area of rays tends to behave similarly, and that prior samples in time can combine with the learned behavior. Given the hardware's preference for SIMD layouts and array math, a good portion might like laying things out in a grid, and it might infer that a good approximation can be derived by their relationship with each along some kind of plane.
     
  3. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    This is probably best example with further information on approach as I can also link the article that highlights the performance gain depending upon GPU used and whether also Tensor cores:
    http://research.nvidia.com/publicat...rlo-image-sequences-using-recurrent-denoising
    https://www.aecmag.com/technology-m...deliver-8x-speed-boost-to-ray-trace-rendering

    1st link describes the approach to denoise autoencoder, but the 2nd link shows the interesting performance gain on comparing P100, V100, V100+AI.
    Still the demos in the rendering thread are more applicable anyway IMO and more recent so a more mature solution.

    Edit:
    Knew I had seen the demo/presentation somewhere, relates to the AECmag ray tracing image:
    http://on-demand.gputechconf.com/si...-michael-thamm-advanced-rendering-nvidia.html
     
    #1063 CSI PC, Mar 19, 2018
    Last edited: Mar 19, 2018
    pharma and DavidGraham like this.
  4. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    #1064 CSI PC, Mar 19, 2018
    Last edited: Mar 19, 2018
  5. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    6,154
    Likes Received:
    928
    Location:
    still camping with a mauler
    That's great but it would still fail in complex scenes with lots of depth discontinuity and motion. Unless their AI can predict the future.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,128
    Likes Received:
    2,887
    Location:
    Well within 3d
    There is a level of artifacting and conservative fall-back that are part of the trade-offs in checkerboard rendering methods. Some of those methods could be applied, just in a more localized manner.
    It's more of a joke about how often a network being trained is going to find correlations based on intersections and reconstructed points stored in a 2D grid.

    A lower density of rays also plays havoc with the granularity of the GPU hardware, which opens up the prospect of "spare" capacity in alignment, fetch, and ALU throughput that can be filled less expensively with historical information or additional analytical work.
     
  7. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    It wouldn’t surprise me if some AI networks are really good at this. Whether or not it can run in real time is a bit of a different story.
     
  8. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    That's not far off of what the async spacewarp was already doing with VR. Where it would really smooth out animation with motion vectors. Stratifying the scene into various depths possibly addresses some occlusion issues and may even benefit the raytracing. The different depths could likely do with being updated/traced at different frequencies as well. Trees and mountains off in the distance shouldn't need updated as frequently as objects in focus. Especially in the case of VR where outer edges of the screen are blurred or rendered at lower resolution anyways.
     
  9. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Well, it sounds like we can expect to see tensor cores in consumer level graphics hardware. I’m not sure where a big perf jump over Pascal is going to come from if the process remains 16/12nm though - the main thing to cut from the gargantuan V100 die seems to be FP64 throughput, and it’s hard for me to imagine that freeing up so much room for big gains in lower precision shader power (at affordable die sizes). I guess the tensor cores on the consumer parts will probably be targeted at inference (e.g. int8 or fp8 instead of fp16)?
     
  10. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah.
    A little while back I was surmising that they will have to implement Tensor cores on the GVx02 and GVx04 however named Tesla models as that is a tier of cards Nvidia also pushes heavily for Inferencing and replace either the P40 or P4 a company may had installed.
    They would have different performance/efficiencies and so option along with price for buyers as the Tensor cores are limited in number per GPC.
    I was not sure how much sense it would make to totally redo the GPU on Geforce without the Tensor cores rather than just disable possibly for now, especially when one also considers potential on Quadro.
     
  11. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,791
    Likes Received:
    2,602
    HOCP did a fresh Titan V review, the card is on average 30% faster than 1080Ti now @4K, in some games like Kingdom Come, it's 44% faster. The Division, Doom, Deus Ex and Sniper Elite 4 is 33% faster. Overclocking easily adds 10~15% more performance. There are still some kinks though, some games flat out refuse to work @4K, and some still crashes. And performance @1440p suffers from CPU limitations.

    https://www.hardocp.com/article/2018/03/20/nvidia_titan_v_video_card_gaming_review/5
     
  12. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    They could decide to increase die size from 470mm closer to the 600mm seen with P100 for their 'V'102 cards, while the 'V'104 move up to 470mm.
    Also looking back at P100 relative to GP102 the FP64 cores increased the die by 29% and with a more compute orientated application such as Amber the 1080ti is around 9-14% faster (the Titan GP102 is a bit more faster again), very crude and simplistic I know
    However a simplistic reduction of V100 is still not enough even for 610mm, and how expensive would it make the 'V'102 to be a mature wafer-die close to 600mm2, although yields should be reasonable now and with flexibility to disable at least 2SM.
     
  13. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    IBM has come up with a nice optimised Machine Learning library specifically optimised for the Power9-NVLINK2-Volta V100 nodes
    https://arxiv.org/pdf/1803.06333.pdf

    Needs more validation but they are suggesting it is much faster than TensorFlow on their system, ridiculously faster while also making very good use of the NVLINK2 relative to PCIe.
    The NVLINK2 performance is in section 5.4 Profiling of the paper.
    Anyway explains nicely their approach to parallelism and also efficient use of sparse data structures.

    Not going to replace Tensorflow, but some of the HPC implementations would probably be looking at it at some point.
     
    nnunn and pharma like this.
  14. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,791
    Likes Received:
    2,602
    Apparently there is a hardware bug affecting the Titan V in certain scientific applications, engineers tried running identical simulations repeatedly on the Titan V, but each time turned slightly different results. This is not a problem in TitanXP, although the author claims that older hardware sometimes had issues like this, that got resolved through patches.

    Some people theorized it's a memory issue. Also apparently it doesn't affect gaming, though there is still a lot gaming instabilities with the Titan V.

    https://www.theregister.co.uk/2018/03/21/nvidia_titan_v_reproducibility/
     
    CSI PC, Picao84, fellix and 3 others like this.
  15. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    They are fortunate this happened before GTC :)
    Regarding the memory though, it is not pushed that hard by Nvidia possibly due to the HBM2 thermal characteristics when it is.
    Maybe comes down to as well having the Tensor Cores enabled when not using them and doing traditional FP32 compute, CUDA documentation linked earlier infers it uses the mixed precision mode when doing so in certain functions.
    Possibly a behaviour they need to modify unless deliberately selected, if it is seen to be contributing to this problem.
     
    #1075 CSI PC, Mar 22, 2018
    Last edited: Mar 22, 2018
  16. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Shame none of them contacted Amber to see if their results are accurate with the FP32 solvent benchmarks for V100.
    I should had said regarding memory, was only commenting in context of the article talking about when memory is pushed, not if it is something else say ECC HBM2 function related.

    But then why was it not identified with P100 if the issue is memory; that was running probably as close to the limits as V100 does with the more mature HBM2 production.
    Maybe some kind of memory/cache related issue though due to changes such as Unified Memory amongst others but could be like already mentioned come back to feature behaviour when Tensor Cores are not disabled.
     
    #1076 CSI PC, Mar 22, 2018
    Last edited: Mar 22, 2018
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,128
    Likes Received:
    2,887
    Location:
    Well within 3d
    It wasn't made clear how significant the errors were, just how often they happened for some cards.
    The magnitude of an error could be massive if it's random flips from bit 0 to 31 or 63.
    If it's more constrained, perhaps a timing issue in specific operations' blocks or conversion logic might use the wrong precision or have errors in a specific subset of bits.

    Perhaps downclocking and overvolting the silicon or memory can help narrow down instabilities in one place or the other.
     
  18. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    They made it sound like it was very reproducable on all the cards.
    The HBM2 memory core is running lower than AMDs and well within spec from what we can tell.
    Like I said the P100 was running closer to the edge with HBM2 (with what was available) but does not have the problems it seems.
     
  19. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,186
    Likes Received:
    1,841
    Location:
    Finland
    They specifically said it happened on 2 out of 4 cards, not that it was reproducable on all the cards. Only spots mentioning anything about reproducibility were the ones discussing how important reproducibility is for scientific calculations
     
  20. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Your right 2 of the 4.
    However it is reproducable as it happens 10% of the time on half their products (a small size for statistics I agree).
    But we have no more information than that compounded by that last sentence I quote, the P100 was running closer to the limits of memory for when it was available and not downclocked.
     
    #1080 CSI PC, Mar 23, 2018
    Last edited: Mar 23, 2018
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...