Nvidia Turing Architecture [2018]

Discussion in 'Architecture and Products' started by pharma, Sep 13, 2018.

Tags:
  1. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    187
    Likes Received:
    79
    Intessting thing. In the patent of PS is written that when tesselation is off, than there is also no Surface Shader and the PS acts directly after input assembler.

    https://patents.google.com/patent/US20180082399A1/en

    Also Mike Mantor stated that AMD can do a lot of culling bevor vertices form a polygon.

    https://www.gamersnexus.net/guides/3010-primitive-discarding-in-vega-with-mike-mantor
     
    #121 Digidi, Oct 1, 2018
    Last edited: Oct 1, 2018
  2. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    187
    Likes Received:
    79
    #122 Digidi, Oct 1, 2018
    Last edited: Oct 1, 2018
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,894
    Likes Received:
    2,240
    Location:
    Well within 3d
    AMD's Vega whitepaper summarized the surface shader as a shader type generated when tessellation was enabled. It encapsulates the initial vertex shader and hull shader stages, while the primitive shader is adjusted to encapsulate the domain and geometry shader stages. The surface shader is associated with the fixed-function tessellation stage, and it is this optional stage that governs whether the pipeline is split at a point of work amplification.

    Task shaders are optional, albeit not necessarily due to tessellation-like behavior. They can spawn extra child mesh shader workgroups, but this isn't necessary. The subdivision point for the pipeline then becomes a less straightforward one than the tessellation hardware presents, as it may not be a customary expansion since the customary tessellation unit is not there.
    The task shader is indicated as being able to make object and inter-mesh level decisions in terms of selecting meshlets, LOD, and meshlet culling. Surface shaders are described as taking the same sorts of inputs as the vertex+hull shader stage, and presenting the usual output. It falls upon the later primitive shader stage to try culling as effectively as it can from the output of the tessellation unit. The broader set of inputs/output and upstream decision making of the task shader were not mentioned.

    Mesh shaders are indicated as being more closely aligned with vertex shaders, requiring some preprocessing instructions. This is similar to primitive shaders' being generated from vertex shader code. Mesh shaders may have some level of culling, but the general tenor is that if the task shader has performed cluster culling sufficiently, it is more effective to let the mesh run through to the rasterizer and beyond rather than pile up more up-front work in the shader path (one subset involving heavy attribute usage mooted a PS-like prospect). Primitive shaders have limited detail as to all the transformations they perform, but a conservative triangle sieve is the most straightforward description of the initial offering. How exactly this process is performed or how it would look in the shader code was not described.
    The mesh shader has an element of precomputation for meshlets in static geometry, which is something the descriptions of primitive shaders do not mention.

    At this point, I think I am repeating my suspicion that these are methods with some evolutionary similarities derived from facing similar conditions in the same location in the pipeline, but with different decisions as to the scope and integration with their individual standard pipelines. Saying one element is the same as the other may gloss over the areas where their priorities and judgement calls differ.
     
    Digidi, Malo and pharma like this.
  4. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    831
    Likes Received:
    231
    Ok, some thoughts. AMD has no "vertex shader", it's a compute shader with "interpolator outputs" (or not, just use UAVs instead), which can easily mean spill-to-LDS. The TaskShader for AMD is the wavefront kicker (probably the ACE/HWS), it takes the number of instances to create from the draw-command and instances them, that's it. The next [programmable] stage get's the lane-number and then does everything manually and drops it somewhere inside the chip/CU.
    The magic is vertex connectivity, but from the software perspective that's just some clever coded likely implicit (from ordering) information. So, the hull-shader is only really producing new instances programmatically albeit constrained, and allows you to specify the instance-seed-values yourself, also possibly through LDS. If that really is fixed function in GCN ... who knows. So, tessellator is also only TaskShader.
    Now comes the real difference. It appears in the MeshShader you have to write the connectivity yourself, while under the old API model the connectivity was implicit (the recursive subdivision rule). That means you have to store that information now for sure. But then, the DomainShader is just the suffix-code of a MeshShader, positioning the connected vertices.

    I don't see how all of this wasn't just minor different conventions to obscure the scheduling part, having it implicit instead of explicit for the sake of ... not sure. Maybe it was Nvidia itself when DX11 was specified, because the DX9 tesselator extension from AMD was completely free of this stuff, the subdivision-level was simply a multiplier for the instance-generator and you got context-free triangles + barycentrics to do whatever you wanted in the vertex-shader (kind of like triangle-shader, not unlike the DomainShader).
    The only different thing I see is, now you control amplification without implicit data generation, and explicit connectivity. And it seems recursive amplification.? Well, the Tesselator was also specified as recursive amplification. That's just streightforward generalization of this heterogenous multiple amplification points in the DX11-API.

    Which tesselation hardware. What would you resonably expect to be dedicated hardware? Try to imagine the most basic generalized building blocks needed to implement it. You don't want overly specialized silicon, I've not seen a tesselation-block on a die-shot (there might be some, but's not big and complicated and ugly). The principle problem is data-routing, and that's probably the only FF part of it, put data from somewhere to somewhere else 1:N style.
    Remember AMD suggests to use the DomainShader to cull triangles using the 1:1 amplification case, basically using the stage as a decimator instead. How's that a senseful suggestion when that's FF?

    Yeah, but that's all just creative imaginations. "Your could possibly". If something's free of implicity I can do all kinds of creative things, that's not a checkbox for MeshShader, it's a checkbox for ThinkOutOfTheMuchLargerBox.

    That's just the same limitation as factor 64 for tesselation. Seems coincidence? Don't think so. I assume <= 64 uses the previously employed internal storage and above it spills to memory just like geometry shader stream-out. They don't say you can't do it, they say it performs predicably upto 64/126.

    It's different so say one is different from the other because of hardware differences (and _not_ performance differences), and to say it's different from each other because of ecosystemic points of view.
    Nvidia dropped this thing as an extension to DirectX into the NVAPI. It's the way Nvidia is going forward, they like and are able to spam extensions. Last time AMD did want to communicate it's vision it led to an entire alternative data-management paradigm (and API) instead of shoehorning it into DX. That shoehorning came from MS, and it's ... hu ... a bit ugly honestly.
     
    Digidi likes this.
  5. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    245
    Likes Received:
    60
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,894
    Likes Received:
    2,240
    Location:
    Well within 3d
    There is an internal stage in the geometric pipeline that performs the function. It didn't seem to have a bearing on the comparison between the different geometry pipelines whether the stage was a direct map to an API shader or not, just what external shader type seems to be most closely aligned with it. For primitive shaders, their description and patents on their automatic generation point to their reliance on vertex shaders as a source, which has some commonality with mesh shaders.
    Surface shaders have more limited disclosure, and were described solely in terms of the VS+HS stage they are analogous to, per the Vega whitepaper. The task shader as described does not bear much similarity to either.

    I have lost track on whether graphics wavefront launch can use the ACE dispatch controllers, but even so from this description the "task shader" in this scenario is missing several functions of the Nvidia-described task shader. There's been no description of a way to get the ACE or dispatch controller to have the sort of access or user-level programmability to evaluate geometry or LOD, or run arbitrary shader code itself.
    This leads to the question of how this variation of task shader can re-direct the next stage in the pipeline to different index/primitive buffers. The Vega whitepaper has this process after input assembly or the tessellation stage, which is further downstream and more local to an object than an list of objects submitted by the graphics command stream.

    I haven't seen a description of AMD doing fully away with the tessellation unit. Nvidia's task shader is described as being able to perform a more arbitrary level of amplification, although it is described as not being as efficient at matching the standardized patterns of the tessellation path. As another difference, Nvidia suggested that it could be optimal to run task and mesh shaders alongside tessellation shaders, which as a matter of definition is not a distinction for surface+primitive shaders.

    I would think the choice of implicit vs. explicit is not so easily glossed over. Explicit creates fewer bounds on what each stage can do or what data it can communicate. Implicit behaviors need to maintain a smaller and more consistent set of behaviors and inputs/outputs. At least for the initial primitive shader offering, it was not going too far from the index and primitive formats already defined.

    This is a comparison between the task/mesh path and the alternatives--either the standard pipeline or surface/primitive shaders. The task and mesh shader would not have dedicated hardware, although the juncture that would serve as a discontinuity in the number of work items is at a location situated similar to where the tessellation stage would have been. Task shaders could be enabled with or without activity similar to subdivision or amplification, and have other actions they can perform. The surface/primitive shader path would have surface shaders enabled if tessellation hardware was enabled, and would not if there wasn't.

    I'm afraid I'm not sure which item this is in reference to. The descriptions of the task shader's decision-making and object processing were disclosed in Nvidia' announcement. The surface shader's encapsulation of the vertex and hull shader stages is from the Vega whitepaper. Neither of what I discussed mentioned the mesh shader stage.
     
    pixeljetstream, pharma and Digidi like this.
  7. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,425
    Likes Received:
    255
    There's nothing in any API that says vertices can only be shaded once. No IHV has perfect reuse across a thousand triangles, let alone a couple hundred thousand. There was an HPG 2018 paper that showed Intel had the largest vertex reuse batching at 128.
     
  8. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    45
    Likes Received:
    48
    How do you deal with UAV access then? It strikes me as a very big problem if you can't tell how many times a given piece of code that modifies global state will be run...
     
  9. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,425
    Likes Received:
    255
    I'm not sure what's done in that case though vertex reuse could get turned off. I know Nvidia changes their behavior when using UAVs. For example, at one point I saw they stop binning when UAVs are used. That might have been UAVs in the PS.
     
  10. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,979
    Likes Received:
    94
    The app just has to deal with it that this isn't known.
    From ARB_shader_image_load_store:
    And further:
     
  11. pixeljetstream

    Newcomer

    Joined:
    Dec 7, 2013
    Messages:
    11
    Likes Received:
    14
    Agreeing about the differences @3dilettante explained.

    The 64 has nothing to do with tessellation factors nor spilling. We mentioned that the sum of outputs should be <= 16 kb for each mesh workgroup (actually we enforce that in the specs).

    Tessellation evaluation also operates on 32 vertices, you can use gl_WarpID and gl_SMID to color triangles to see how the hw distributes work (like I showed in the life-of-a-triangle blog post).

    Why 64/126. Well I played with various CAD datasets and those numbers worked well. Actually 64/84 is more realistic vertex re-use in many datasets.
    Like mentioned in the talk, it's a trade-off between higher on-chip storage (less occupancy) and potentially higher vertex reuse. You would not want to blow out 16kb unless you have a very good reason...

    I'd also think 64 is probably what people from consoles are used to.
     
    pharma likes this.
  12. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,425
    Likes Received:
    255
    Wanted to correct myself. Nvidia did not stop binning when using UAVs. My memory was faulty.
     
  13. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,019
    Likes Received:
    3,654
    What did I miss? Aren't the Turing cards being sold at MSRP, at the moment?
     
  14. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,485
    Likes Received:
    1,237
    GeForce Experience 3.15 Release adds DLSS support. Now all we need are the games.
    October 15, 2018

    https://www.geforce.com/geforce-experience/download
     
    OCASM likes this.
  15. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    640
    Likes Received:
    202
    Not sure if anybody noticed, but tensor TFLOPs are reduced by half on Geforce RTX versus Quadro RTX.
    That is for the mixed precision FP16/FP32, which are important for neural network training.

    Peak FP16 Tensor TFLOPS with FP32 Accumulate
    RTX 2080 Ti | 53.8 / 56.9
    Quadro RTX 6000 | 130.5

    Only the FP16/ FP16 accumulate are full speed, but these are useless for NN training AFAIK
    Peak FP16 Tensor TFLOPS with FP16 Accumulate
    RTX 2080 Ti | 107.6 / 113.8
    Quadro RTX 6000 | 130.5
     
    Ike Turner, BRiT, Malo and 1 other person like this.
  16. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    932
    Likes Received:
    212
    The Quadro RTX 6000 sells for $6300 whereas the RTX 2080 Ti only sells for $1200. So cutting down the performance on the Tensor cores on the RTX 2080 Ti is expected really.
     
  17. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,720
    Likes Received:
    1,939
    Location:
    Germany
    Interestingly, the CUDA Turing Compatibility Guide mentions someting along these lines, but does not discriminate between Quadro and Geforce -- or rather, does not give away the Quadros hidden feature.
    „Most applications compiled for Volta should run efficiently on Turing, except if the application uses heavily the Tensor Cores, or if recompiling would allow use of new Turing-specific instructions. Volta's Tensor Core instructions can only reach half of the peak performance on Turing. Recompiling explicitly for Turing is thus recommended.“ [my bold]
     
    pharma likes this.
  18. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    83
    Likes Received:
    73
    I think that's because of Int8 speed. Turing has double Int8 speed compared to Volta.
     
    pharma likes this.
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,720
    Likes Received:
    1,939
    Location:
    Germany
    Not sure if those are connected. The quoted peak performance was with FP16 requirement. It should not auto-demote to INT8, since this is a manual optimization for a reaseon. It should only be valid once you have ascertained that reducing precision does not invalidate your algorithm.
     
  20. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    187
    Likes Received:
    79

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...