Nvidia Turing Architecture [2018]

Discussion in 'Architecture and Products' started by pharma, Sep 13, 2018.

Tags:
  1. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    987
    Likes Received:
    278
    I there a video of him juggling those spinning plates :-?
     
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,773
    Likes Received:
    2,560
    Is Variable Rate Shading just Rapid Packed Math on steroids?

    NVIDIA showed Wolfenstien 2 running with a technique called Content Adaptive Shading, which uses Variable Rate Shading to dynamically identify portions of the screen that have low detail or large swathes of similar colors, and shades those at lower detail, and more so when you’re in motion. The resulting fps increase reached 20fps just from switching on that technique compared to switching it off.

    https://www.pcworld.com/article/330...cs/nvidia-turing-gpu-geforce-rtx-2080-ti.html
     
  3. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Isn’t RPM just 2xFP16 squeezed into FP32?

    I don’t see the link with VRS? It seems to be completely different.
     
    Alexko, pharma and DavidGraham like this.
  4. AlBran

    AlBran Ferro-Fibrous
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,716
    Likes Received:
    5,813
    Location:
    ಠ_ಠ
    Seems to be a more flexible evolution of Multi-Res Shading.
     
  5. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,773
    Likes Received:
    2,560
    I see, I though it works through double FP16 which Turing supports in the consumer line now.
     
  6. AlBran

    AlBran Ferro-Fibrous
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,716
    Likes Received:
    5,813
    Location:
    ಠ_ಠ
    The concept in the slides at least seem to point to shading per a given grouping of pixels (I think).

    Dynamically switching between fp16/fp32 is probably not what's going on there.
     
    DavidGraham likes this.
  7. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    225
    Likes Received:
    97
    But if you compare diagrams they look now the same.

    And if you look at the Turing asteroid tech Demo they also talk about the huge amount of polygons. That was one mayor aspect of Vega.

    For Vega AMD claimed that they can cull at a very early stage. Also Turing have these feature?
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    There is a similarity in how the pipelines go from *fixed* *programmable* *programmable* *fixed* *programmable* *programmable* *fixed* to *fixed* *programmable* *fixed* *programmable* *fixed*.
    What happens at the input and output of each programmable block also changes in several parts of Nvidia's diagram. AMD's discussion of its new geometry pipeline kept the input assembler and tessellation stages as they were, which may constrain the changes somewhat by not giving it the same kinds of flexible inputs and outputs Nvidia's method can choose. It may be that it has moved portions of those intervening non-programmable functions into the combined programmable stages.
    The biggest emphasis from AMD was enhanced culling, which Nvidia's section discussing its mesh and task shaders didn't really emphasize. Granted, the apparent flexibility could mean that the minimum promised by primitive shaders could be handled as well.

    The asteroid demo's change was having the front-end shader select a different variant of the model based on how much detail was actually necessary, not reading in and then culling out non-contributing triangles with extra shader code on top of the existing shaders. The primitive shader is taking orders from a standard draw call where the decision making was done earlier, not selecting different models on the fly.


    Other changes indicate some tweaks to the other stages like the hand-off from rasterizer to pixel shader stage, where coverage and shading can be varied on what appears to be a granularity close to a rasterization tile rather than whole swaths of screen space.
    The texture-space shading in combination with the other changes remind me of some thoughts I had about VR rendering a while back, where the quick and dirty way of scaling performance was by having a GPU per eye. A modified rendering scheme that could borrow some of the reconstruction techniques that leveraged temporal coherence might be used to skip duplicated rendering operations on objects too far from the viewer to have a significant difference in angle or contribution. Further, with the variable and foveated rendering, more work could be skipped due to the limited space from which the human visual system can infer depth or perceive significant detail.

    Some of these shortcuts don't work as well if the system cannot account for the viewer's eyes moving, however.


    Perhaps I missed it, but is there some detail in the whereabouts of the polymorph blocks? (edit: never mind, found them after squinting at the block diagram)
     
    #28 3dilettante, Sep 15, 2018
    Last edited: Sep 15, 2018
  9. dogen

    Regular Newcomer

    Joined:
    Oct 27, 2014
    Messages:
    335
    Likes Received:
    259
    Regarding the part about mesh shaders in that hexus article "Running on older hardware, though possible, would require a multi-pass compute shader to be used, negating the benefits entirely."

    isn't that bs? aren't there games that do exactly that already?
     
    3dcgi likes this.
  10. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    331
    Likes Received:
    85
    Most of the articles on the architecture are BS, not surprisingly. "Variable rate shading" you mean the F*ing software paper on that recently that has shit all to do with hardware? Yeah thanks for including that in a hardware paper. Why is the "white paper" is filled with PR bullshit? I'm just trying to read about your computer architecture guys, keep it out of the hands of the god damned PR people.

    That being said, there some cleverness here. The restructured low level cache seems like a good idea and a straight up win. Depending on the separate INT cores actual silicon area though it may not justify their stated max throughput of 36% improvement, but what size it is, well just isn't known. I'd also question the placement of their tensor cores in the same SM as FP/INT compute. A huge amount of energy usage from inferencing comes from memory shuttling, which is why inferencing specific chips have huge local caches, far bigger than those usually needed on other sorts of GPU tasks. It still feels like a compromise between the older training purpose of tensor cores, and what games would use them for in the near future, which is solely for inferencing. If they're really diverging gaming and high performance compute then shoving the cores into their own thing with their own cache structure could be a bigger win.

    From that paper that's all I can really conclude. There's a lot of shit, and not a lot of info. Fortunately Anandtech has done a proper job, and shows that CUDA cores are responsible for BVH construction. How this is done isn't shown, and I wonder what the performance is. Too many moving parts could quickly bottleneck any game. Regardless one of the biggest things is how big things are. For reference a GTX 1080 (GP 104) is a mere 341mm^2, the 2080 is a massive 545mm and that's on the smaller 12nm process. Assuming the linked leaks are real, performance per mm has, uhmm, gone down since Pascal. To even equal Pascal from a 1080 to 2080 would need roughly a 60% performance increase. The performance increase is more along the lines of 40-45%. Keep in mind, that's not taking into account that there's more transistors per mm with the improved process. Right now, considering talented programmers can get the same level of raytracing performance out of a 1080ti as Nvidia claims can come out of their new RTX cards, well consider me unimpressed.
     
    compres likes this.
  11. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,773
    Likes Received:
    2,560
    Which talented programmer? Who even made such a preposterous claim?
     
    homerdog, pharma and OCASM like this.
  12. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    360
    Likes Received:
    252
    I guess he is talking about Distance Field rt, or some kind of voxel cone tracing, or some sphere tracing. Obviously, he doesn't realise there is a difference once we speak about polygon soup tracing, which current games are.
     
    #32 OlegSH, Sep 15, 2018
    Last edited: Sep 15, 2018
  13. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    Even if the new RTX hardware would be only twice as fast as the fastest possible GPU raytracer, that would still be good.
    Published results for GPU raytracers primary rays speak in the range of 1-3 Grays/s on a Pascal GPU with scenes ~1 million polygons. Turing is doing 12 Grays/s on the Standford Buddha (100K polygons if I'm not mistake) of the white paper pg 33. On the otherhand the bounding box of the Buddha model covers only 1/3 of the screen where as the scenes in the paper cover the whole screen. Taking that into account that would be ~2 Grays/s for Pascal compared to ~4 Grays/s for Turing.
    Question is are primary rays even relevant, as this can be done much faster with rasterizing. Secondary rays can be much more incoherent trashing caches, but those improved a lot also on Turing.
     
  14. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    I was entertaining the thought, that dedicated INT32 cores might consume less energy doing their INT32 work than having to shove this through the FP32 pipe. Welcome to the world of energy over space. I might be wrong though, but further idle thoughts came up, letting me consider the idea, Turing is a contingency plan for 7 nm not being ready in time/not available in a large enough volume/not living up to the expectations energy wise. That would be supported both by the separation of cores not solely for performance increase but more energy efficiency as well as the immense amount of chip area invested for consumer products. It also would explain why no one outside of Nvidia every heard of Turing a couple of months ago. Maybe it was intended to be Ampere at 7 nm and with ~40% less die space.

    I find it amusing, that after the 2006/2007 rally of unifying everything on the chip, we are now back to dedicated units for a lot of special compute cases again. Not counting Rasterizers, TMUs etc. which have been there all along, we now have:
    • SMs with Shader-ALUs „Cuda Cores“ including
    •FP32 groups
    •INT32 groups
    •FP64 groups
    •Tensor groups
    •RT „Cores“ (BVH traversers/Tri intersection checks)
    •L/S + TMU​
     
  15. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,431
    Likes Received:
    551
    Location:
    Texas
    I think that’s a very reasonable idea. It all depends on the complexity of the pipeline. It could also be to shorten timing paths in the pipeline stages to allow for higher clock speeds.
     
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Why isn’t DLSS 2x the standard mode? Seems a little bizarre to push a lower resolution render that just matches TAA instead of the mode that actually improves IQ.

    Hopefully the larger network requirement doesn’t limit the availability of the 2x mode.
     
  17. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,773
    Likes Received:
    2,560
    DLSS 2X doesn't give big increases in fps. It comes at the cost of some performance.
    Exactly.

    Here is what Sebbi had to say about RTX (hardware BVH) and his cone tracing implementation:



     
    Kej, senis_kenis, Heinrich4 and 3 others like this.
  18. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    If you want to compare with real ray tracing for voxel based rendering, there are some videos here.
    The Colon bottom video renders at 4K, with 16 views covering the whole screen at 2 Grays/s on a Pascal GP102 GPU. (only 100 Mrays/s on a 8 core CPU with most of the time spent on trilinear voxel interpolation obviously). No use of BVH as this would be far from optimal, and also not flexible to handle transfer function changes. I can create some extra videos.
     
    #38 Voxilla, Sep 15, 2018
    Last edited: Sep 15, 2018
  19. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    7,029
    Likes Received:
    3,101
    Location:
    Pennsylvania
    I don't think it's really a valid comparison with all the extra die space dedicated to Tensor and RT cores since comparing to Pascal you're referring to perf/mm2 in standard rasterized games. If you discount the die space for the additional hardware, the uplift is probably expected amount.

    Alternatively, get a game with RT and DLSS support with TAA level for comparison and the perf/mm2 is probably huge for Turing.

    Yes it starts getting harder now to simply compare perf/mm2 since there's a lot of hardware in there sitting idle currently. I just don't think it's directly comparable anymore, at least until next gen Turing.
     
    compres, xpea, pharma and 1 other person like this.
  20. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    225
    Likes Received:
    97
    Also Sebbi is taling about the primitive shader /mesh shader
    (more Information in the Twitter feed)
     
    OCASM, Kej, Silent_Buddha and 5 others like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...