Nvidia Turing Architecture [2018]

Discussion in 'Architecture and Products' started by pharma, Sep 13, 2018.

Tags:
  1. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    Put some ARM cores on the GPU and get rid of driver and PCIe overhead :)
     
    Kej, milk, Man from Atlantis and 2 others like this.
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,773
    Likes Received:
    2,560
    I dread what's gonna happen once 7nm GPUs are released. We already have a CPU limitation even @1440p. If the 3080Ti is released in the end of 2019 that CPU bottleneck will creep into 4K too! With the stagnation of CPUs expected to continue for the foreseeable future, an extensive multi core support seems the only way forward at this point.
     
  3. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,170
    Likes Received:
    576
    Location:
    France

    I remember when T&L was a thing, selling this feature like "NOW THE CPU HAS NOTHING TO DO !". Aaaaaah the good old time...
     
    Kej, milk, jlippo and 4 others like this.
  4. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    What we're seeing happening now is that the scheduling is getting moved off the CPU and onto the GPU. It's been possible now for a couple generations for full scheduling and resource management to happen on the GPU in the Cuda world, and we already have Execute Indirect in DirectX allowing the GPU to issue its own draw calls from shader code. Now we have task/mesh shaders, which combine to form another way for the GPU to feed itself.

    We just need device side malloc/free and memcpy to come to the graphics APIs. As I mentioned, this already exists in Cuda, and I suspect the holdup in getting it across to the graphics APIs has to do with trying to shoehorn it into a very complicated system that expects all memory to be managed host side rather than hardware limitations. I'm not sure the status is on AMD/hip (which doesn't work on Windows in any case), though I don't believe there is any such capability on Intel/OpenCL.
     
    w0lfram, nnunn and Heinrich4 like this.
  5. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    What is going to be the functionality of this scheduling moving forward though? Are we talking about continued modified DX11 support or DX12/Vulkan w/async compute?
     
  6. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    We're talking about abandoning DX11.
     
    BRiT likes this.
  7. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    Well, I would hope so. It seems to be naturally good at none dx11 titles.
     
  8. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    331
    Likes Received:
    85
    An easier programming interface is probably appreciated for non high performance programmers. But with raytracing and Vulcan being merged with Open CL that seems a lot easier even than DX11, at least at whatever point that would become common enough to use for the mass market.

    Regardless if AMD needs to get anything from Turing (other than non compute shader fallback DXR support) it's Mesh and Task shaders. I'll still complain that games should move to a different, non polygonal model all together for the art pipeline. Subdivision surfaces or voxels or. Well something that gets rid of stuff like UV mapping and all the other junk making art slow and expensive today, there's got to be some clever model out there that could be fast to render, low memory, and very easy to manipulate, etc. But being realistic, that's going to happen too slowly for Mesh and Task Shaders not to be useful for quite a while. Depending on some "new" model being adopted they could well be useful after too.
     
  9. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    Not the problem of driver and low-level API makers, but language developers which need no affiliation with the hardware companies. Lack of innertia, capitalization, stability, independence?
    This is such a complicated topic ... GPUs should have been programmable co-processors since a while, they are not because ... man this is a really difficult topic. Maybe it's because GPU makers are more worried and concerned with quality of service (runnable hardware, drivers, etc.) and competition at any moment in time, than with programmers wrath. ;)

    Why? Has is passed the test of time already? Maybe it's too complicated to program for. Maybe primitive shaders are the better generalization.

    Content creation and display have very little to do with each other. Improve the translation from one domain to the other and your problems go away. It hopefully will, in time.
     
  10. iamw

    Newcomer

    Joined:
    Jul 20, 2010
    Messages:
    21
    Likes Received:
    44
    TIM截图20180930102633.jpg
    Interesting
     
    sonen and nnunn like this.
  11. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    Well, one of the nice features about mesh/task shaders is that they're already exposed in Vulkan and OpenGL extensions, so you can use them today. No such luck with AMD's primitive shaders.

    Anyway, having read everything I can find about both technologies, I still only have a vague idea what primitive shaders actually do and where they live in the pipeline, other than a vague bit about them replacing the vertex and geometry shaders.

    Mesh and task shaders are well documentated here: https://devblogs.nvidia.com/introduction-turing-mesh-shaders/

    They appear to be no less than direct control of the geometry pipeline through something similar to compute shaders. This could be very useful for generating geometry with algorithms that don't map well to the traditional pipeline. Things like REYES, volumetric, subdivision surfaces, and even optimizations like various forms of culling and level of detail. The important point is that the mesh shader can directly feed the triangle setup without a round trip to memory, so there's a sizable bandwidth saving from just generating triangles in a compute shader. Also, the fact that the task shader can recursively generate tasks is a major win in certain cases, such as level of detail when you don't want the overhead of instance rendering. Give the task shader raw object data and let it select levels of detail and dispatch to mesh shaders, and perhaps even do camera dependent culling.

    The biggest disadvantage I see to mesh/task shaders is that they don't seem to have access to the tessellation hardware. Otherwise, they're a close to the metal replacement for the entire geometry pipeline. The old geometry pipeline is a rather poor abstraction of modern hardware anyway.

    As for how mesh shaders compare to primitive shaders, it's anyone's guess. They might operate almost identically. Of course until AMD releases documentation and API support, the question is academic.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    I think relative to the traditional pipeline and something like the NGG pipeline and primitive shaders, the task/mesh path in some parts encompasses a wider scope while excluding others.
    It is a more programmable and exposed method that can reach further upstream in terms of using objects as inputs rather than an index buffer, but it is also considered an adjunct to the existing path for things like tessellation. Also, some of the biggest improvements stem from pre-computation of meshlets and their benefit for static geometry. What the calculus is for the non-static component may be different, and the full set of features a developer might want to use in DX12 may involve running these paths concurrently.
    Primitive shaders do add programmability, but there seems to be a stronger preference for insinuating them into the existing pipeline, which may constrain them somewhat relative to the broader swath of inputs and options available to a pipeline that doesn't try to remain compatible with the existing primitive setup path and tessellation behaviors. This does mean the difference from the prior form of geometry setup is less, and there's still one path for everything. Potentially, the originally offered form of them would have enabled higher performance with very little developer involvement.

    Both schemes hew more to using generalized compute as their model, so it seems possible that one could be programmed to arrive at similar results to the other. Off-hand, there's a brief reference to perhaps using a culling phase in the mesh shader description, and the Vega whitepaper posits some future uses of primitive shaders that can lead to similar deferred attribute calculations and may allow for precomputing some of the workload.
    There is a difference in terms of emphasis, and I am less clear on how fully primitive shaders change the mapping of the workload to threads relative to how it is documented for mesh shaders. There is some merging of shader stages, though the management may differ in terms of what the hardware pipeline may manage versus shader compiler and how much is in local data share versus caches, etc. At least initially, primitive shaders are more concerned with culling, and they do this culling more in the period where mesh shaders would run. Task shader cluster culling is the more discussed path for Nvidia, and it has more discussion on what to do when even despite culling there are a lot of triangles.
    The difference is marked enough that Nvidia has more faith in the culling or rasterization capabilities of its hardware in the post-cluster culling set of mesh shaders, whereas primitive shaders seem to have the cycle penalty of even one extra triangle reaching the rasterizer much closer to the forefront of their marketing.

    The NGG pipeline also introduces another level of batching complexity at the end of the pipeline which task/mesh shaders do not. That may not have directly interfered with the design decisions of the earlier stages, but it may have introduced considerations for later raster and depth limitations into the optimal behaviors of the primitive shaders, which task/mesh may not be as strongly bound by.
     
  13. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    I suspect that simple backface culling might be a whole lot more efficient at the meshlet level. Instead of testing every single triangle, for each meshlet, store a bounding box for the geometry as well as a "bounding box" for the the triangle normals. Then you can compute all back facing vs. maybe some front facing for an entire meshlet at once. If we assume that the content authors have some halfway decent software to cluster triangles, I can easily imagine 30% of meshlets culled as back facing. Moreover, this should be more efficient than trying to do this automagically in drivers and hardware since the developers will have a clear picture of how their culling system works and this how to optimally order and cluster triangles, rather than guessing at a black box.

    Another important thing is that the developer now has full visibility and is in control of warp and block scheduling, rather than the hardware just doing whatever assigning vertices in a black box fashion. This is VERY useful for things like subdivision surfaces where you need a lot of adjacency information.
     
    pharma likes this.
  14. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    Thinking about it some more, how does the hardware go from rasterizer to pixel shader scheduling? Triangles are so small now that it would be grossly inefficient to not set up pixel shader blocks on many triangles at a time. Care with meshlet shape could therefore provide the pixel shaders with better locality at block assignment time, leading to better utilization and coherency, especially considering that pixels are shaded in 2x2 quads, making them all the more vulnerable to fragmentation.
     
    milk likes this.
  15. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    I still wouldn't say something is better in the absense of comparability, or somethings has to learn from something else in the absence of visibility/knowledge.
    The meshlet definitions breaks compatibility with current assets, and in the least intrusive implementation you have to transcode the index-buffers on the CPU. When you're already CPU-bound it leaves a bit of a bad taste.

    AMD had compute fronts for vertex shaders since a while (ES), it had no IA so vertex buffer reads were software fetches also since a while. The tesselator has a spill path since forwever, so mesh amplification to memory (and performance results) is also an old hat. Doesn't sound like they are too far away from what's been proposed. And I would believe you can do the Task/Mesh shader pipeline on an API level easily enough with GCN, maybe they just need to bump the caches to be able to pass the magic 64 vertices and 126 triangles ;) under all circumstances from stage to stage. GCN is a compute design in the first place, so being able to put wave/compute functionality into vertex shaders is a given (UAVs in vertex shaders etc.).

    So, lets say optimistically Task/Mesh shaders are trivial, rather: just an API-choice, for AMD. Now, are Primitive Shaders just an API-choice? Is there something more in it? I don't know.

    Personally I would hope we get some post-rasterizer scheduling capability, maybe we can sort and eliminate and amplify pixels in there.
     
    w0lfram and Digidi like this.
  16. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    Another use for task/mesh shaders might be for deferred shading. Cluster pixels by material, then feed them to mesh shaders which render directly to a UAV rather than actually doing mesh processing. This could lead to lower divergence. Might be *really* tricky to figure out an algorithm for the clustering, though.
     
    OCASM and pharma like this.
  17. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    They have: AMD's Tootle
    Look into section 4: Fast linear clustering. You can pass maximum triangle limits for cluster generation in the API.

    Seems ironic.
     
  18. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    That's exactly the sort of software I was talking about ;-)

    Note that being able to put this through a mesh shader lets you guarantee that the clusters remain coherent. Imagine what would happen if for some reason you had a stride misalignment between your clusters and your vertex shader block assignment!
     
  19. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,928
    Likes Received:
    1,626
    Mesh Shader Possibilities
    September 29, 2018
    http://www.reedbeta.com/blog/mesh-shader-possibilities/
     
    #119 pharma, Oct 1, 2018
    Last edited: Oct 1, 2018
  20. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    There are also some edge cases that would pretty much mean that the results outputs of vertex shaders always have to be stored back to memory. Consider what happens if you're happily going along through vertices and come across an index pointing to a vertex that you shaded so long ago that the output no longer lives in the cache. In the old days, you could just go "hell with it" and put the vertex through the shader a second time, but now vertex shaders are allowed to have side effects, meaning that they *must* be run exactly once. Thus everything must be saved for the duration of the draw call. If you assume a mesh with a couple hundred thousand triangles, you're talking about pushing quite a few vertex outputs off chip.
     
    pharma likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...