Nvidia Turing Architecture [2018]

Discussion in 'Architecture and Products' started by pharma, Sep 13, 2018.

Tags:
  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    Hopefully someone does a good article on Primitive/Mesh/Task shaders with real world examples. Still not clear to me what’s so terrible about the existing pipeline and what’s so great about the combined stages. It’s all the same code in the end isn’t it?
     
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,796
    Likes Received:
    2,054
    Location:
    Germany
    You could end up with a lot less draw calls using NVs or AMDs revised pipelines. But the feature has a) to work and b) to be implemented by developers, AFAIU.
     
    pharma likes this.
  3. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,148
    Likes Received:
    570
    Location:
    France
    Honestly, for AMD I think it's dead for Vega at this point. Not even available for devs. So I guess we'll see for Navi, and how it compares to the nVidia solution. Pretty interesting.
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    Seems it’s dead unless Microsoft adopts it. Pretty fundamental changes to the pipeline.
     
  5. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    The way I read the Variable Rate Shading section of the whitepaper, Turing has hardware support for something that, in the past, was done in software. If you can specify, in hardware, for each 16x16 rectangle on the screen that the shading rate can be lowered to 1/2 or 1/4 of the pixels, then you are reducing the load on the shaders and texture units by a similar amount. (I assume that the ROPs will still be operating at the same rate?)

    I'm trying to read about computer architecture too, and my reading is clearly very different than yours. I must be biased.

    But I'm sensing a pattern in your line of argumentation here:
    iff <could be done in software> and <Turing does it in hardware> then <it's a bullshit feature>

    First ray tracing (news flash: I ran ray tracing on my 80286), now this. Why don't you start by considering performance aspects as well before making judgement?

    Here's another conclusion from the whitepaper: the RTX cores may simply be 'an extra instruction', as you wrote earlier, but it seems pretty clear from the whitepaper that this extra instruction kicks off a LOT of hardware. I can't wait to see how Vega will pull that off with a software driver.

    I wonder how you come to this conclusion. Here's all I can find in the Anandtech article about BHV construction:
    Maybe I missed something, but the whitepaper doesn't talk about BHV construction at all either.

    Oh boy, here we go again.
     
    OCASM, Heinrich4, Geeforcer and 6 others like this.
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    The article does make the claim but I don’t see anything in the white paper. Maybe AT got some info offline.
     
  7. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Thanks for pointing that out. I finally found the quote that I was looking for:
     
  8. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    For me, "variable rate shading" screams "foveated rendering" for VR. Yes, you can do it in software on older hardware, but for various reasons (SIMD utilization is the big one. Basically, the rasterizer thinks it should render all the pixels, so you get divergence in all your pixel shaders since your SIMDs get populated with all the pixels you're dropping) it turns out to be pretty slow. The paper I read was reporting something like a 10-15% performance increase, but this is with them subsampling down to only 50% or less of the original pixel count! Having the fixed function bits and pieces around the pixel shader actually designed with variable sampling rate in mind should basically make pixel shading costs scale 1:1 with your sampling rate.
     
    OCASM, pharma, DavidGraham and 2 others like this.
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    Thanks for sharing.

    So far what I’ve gathered is:

    - Mesh shaders ultimately spit out triangles to be rasterized into pixels to be shaded
    - Mesh shaders can do much of the object level culling and LOD selection work currently done on the CPU
    - This stuff could all be done using compute shaders but a lot of memory allocation and caching optimizations are taken care of by nvidia if you use the mesh pipeline
    - Because of these optimizations and offloading work from the CPU we can now handle much more complex objects made up of many triangles by breaking them into smaller chunks of work (meshlets)

    What I don’t get is how does this help with having lots more objects?
     
    Lightman, jlippo, Heinrich4 and 2 others like this.
  10. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    I think part of it is that you completely bypass a bunch of fixed function geometry hardware that does not scale as well as the number of SMs.

    If that hardware was a bottleneck before, you have now eliminated that.

    Another one is that it potentially allows for more reuse of data, which reduces redundant calculations.
     
    OCASM, Alexko, Lightman and 8 others like this.
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,118
    Likes Received:
    2,860
    Location:
    Well within 3d
    My reading of the method is that decisions about cluster and LOD handling would occur in the task shader, which would then control how many mesh shader invocations would be created, and what it is they would be working on.
    Per-primitive culling is something generally recommended to be left to the fixed-function hardware--which has also been improved since Pascal. The video showed a few scenarios where the conventional and mesh pipelines were rather close in terms of performance, highlighting the balancing act of managing culling in cases where not enough work is saved later in the process to justify the obligatory up-front overhead. While still a generally significant improvement, the video presentation had an interesting, almost awkward, tone about how the fixed-function hardware had kept improving and made the new shaders less impressive in certain places.

    At least for the compute methods discussed by the presentation, using them at the front of the overall pipeline means they can stay on chip and reduces the overhead introduced by the launch of large compute batches. I presume there's cache thrashing and pipeline events that must write back to memory if the generating shader is in a prior compute pass.

    The base pipeline always has a primitive distributor that runs sequentially through a single index buffer, creates a de-duplicated in-pipeline vertex batch, and builds a list of in-pipeline primitives that refer back to the entries from the vertex batch they use.
    When much of the workload does not change what this one serial unit has to work through every frame, this means a lot of duplicated work bound by straight-line serial processing speed.

    The meshlets effectively take the in-pipeline vertex and primitive lists and package them for reuse. Individual mesh shaders can then read from multiple mesh contexts, increasing throughput. On top of that, they are permitted to use a topology that reuses vertices more than the traditional triangle strip, and can be made to reduce the amount of attribute data that is passed around. This further compresses the bandwidth needs of the process. The overall path also has a threshold versus multi-draw calls where if an instance is small enough to fit in the shader context, it allows more parallel processing of instances versus the serial iteration a multi-draw indirect command has at the top of the pipeline.
    In combination with the task shader, the overall pipeline can distribute work more concurrently, align primitive processing more effectively with the SIMD architecture, leverage the existing on-chip management, and it leaves open programmable methods for compression and culling.
    What specifically happens for the more dynamic parts of the workload that do no benefit as much from reuse wasn't in the video, though it was noted these methods were considered an adjunct to the existing and still-improved traditional pipeline. More variable work may not "compress" and the existing tessellation path is still more efficient for the specific patterns it works for.


    There are some parallels with what AMD has described for its geometry handling. Both methods have merged shader stages, and there's a similar split into two parts at a juncture where there is the potential for data amplification, roughly where the tessellation block is, or where a task shader begins to farm out work to child mesh shaders.
    However, which parts of the traditional pipeline are kept as-is or have parts of their functionality replaced differs.
    Culling at a cluster or mesh level happens earlier in the Nvidia pipeline, whereas the mesh and primitive shaders have more clear affinity with the vertex shaders.
    One emphasizes per-primitive culling and dynamic context management for batching injected into a more traditional hardware flow, while the other has a more significant break exposed to the software that has more management incorporated into the code. The various batch sizes and primitive counts exist in one form or another for both, but what the motivations are and when they are exposed differs along with whether they are transparently handled suggestions versus structure definitions to the shaders.
    The more explicit break with task and mesh shaders does seem to offer a wide range of options, though it would be less transparent to developers (or would if primitive shaders had done what was initially promised).
     
    pharma, OCASM, Jupiter and 5 others like this.
  12. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    12,277
    Likes Received:
    8,469
    Location:
    Cleveland
    A series of Tweets from Sebi … about a slide from the Nvidia Turing Architecture Whitepaper page 33, discussing the performance number of GRays.




     
    compres, OCASM, Lightman and 8 others like this.
  13. pixeljetstream

    Newcomer

    Joined:
    Dec 7, 2013
    Messages:
    30
    Likes Received:
    60
    @3dilettante well written.
    One thing so, the compute overhead is for small dispatches, not large ones (those would be hidden). As you said there is also extra waits for completing previous tasks. Now it may be doable (on console I guess) to tune producer/consumer batch sizes on architecture specific L2 sizes to keep things on chip, but very much not portable.

    Also want to stress that the meshlet data structures shown just serve as basic example.

    When originally designed it was not sure whether fixed function blocks would see further improvement, hence my "awkward" comment ;) However as scaling can be demonstrated and depending on the features success, it opens the door to not having to improve fixed func at some point. (The market decides).

    Not sure what you mean by less transparent about task/mesh split, given that is all up to developer?
     
    #54 pixeljetstream, Sep 18, 2018
    Last edited: Sep 18, 2018
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,118
    Likes Received:
    2,860
    Location:
    Well within 3d
    The lead designer for the PS4 indicated there was a triangle sieve option for compiling code from a vertex shader, which would produce a shader containing only position-related calculations for culling. This would then be linked to the geometry front end running a normally compiled version of the vertex shader by a firmware-managed ring buffer hosted in the L2 cache, which was something Sony considered a tweak for their architecture.
    It was an optional feature that needed some evaluation to determine whether it would lead to an improvement, although I have not seen references since then on how often that tweak was used.

    It's a frequently retold story for designs facing the programmable/dedicated wheel of reincarnation, sometimes it's where on the cycle a design finds itself. Perhaps a future design will be able to heavily leverage the more programmable path and leave the traditional hardware path alone, and then someone starts to think what hardware could accelerate the new one.

    Transparent in this case would be whether the shaders would need to expose elements like the number of primitives per mesh or limits to the amount of context being passed from one stage to the next. Some alternate methods for refactoring the geometry pipeline had thresholds for batch or mesh context, but they could be hidden from the developer to varying degrees, either by the driver or in some cases by hardware that could start and end batches based (edit: missing words "on conditions") during execution. Granted, being totally unaware of those limits could lead to less than peak performance, and some of those hardware-linked behaviors add some dependence on other parts of the hardware which the explicitly separate task and mesh shaders discard.
     
    #55 3dilettante, Sep 18, 2018
    Last edited: Sep 18, 2018
    Silent_Buddha and Malo like this.
  15. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,416
    Likes Received:
    533
    Location:
    Texas
    Metro Exodus Geforce RTX global illumination demo:

     
  16. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,718
    Likes Received:
    2,454
    WoW, this is going to be the best RTX demo ever in the near future.
     
  17. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    6,952
    Likes Received:
    3,032
    Location:
    Pennsylvania
    It's going to be fantastic playing something like that in 3-5 years.
     
    ToTTenTranz, eloyc and Silent_Buddha like this.
  18. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    15,947
    Likes Received:
    4,903
    So much nicer without the AO. AO always seemed like a weird awkward and inaccurate kludge. Some people liked it, but I always felt it made things look worse as it made it look even less natural than without AO.

    Regards,
    SB
     
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    Without which AO - SSAO or RTAO? :)

    Some kind of AO is absolutely necessary in scenes that don’t use static prebaked lighting. Otherwise everything looks very flat and floaty.

    Aside from the obvious improvement in dynamic lighting it’s hard to tell whether overall IQ is actually improved under RTX in the Exodus demo. It seems with RTX on the ambient lighting term is greatly reduced but other surfaces seem overly bright.
     
    Billy Idol likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...