AMD Mantle API [updating]

With a bindless setup, instead of working with slots, you instead directly provide the GPU with pointers that it can follow to find the texture info. The slide is showing a potential data structure that you could set up yourself, where one piece of memory has a pointer to another piece of memory filled with info for more resources.

So its basically like CUDA (or OpenCL) where you just allocate memory on host for the device and use the device pointer as an argument in your kernel.

What he means is that you could allocate a piece of memory and re-use it for many different purposes. In D3D11 the memory for a resource is tied to the ID3D11Texture2D. When you create that texture, you specify certain immutable properties like the format, the size, number of mip levels, etc. and the driver allocates the appropriate amount of memory. Now let's say at the beginning of a frame you render to a render target, but then you're done with it for the rest of the frame. Then immediately after, you want to render to a depth buffer. With full memory control, you could say 'use this block of memory for the render target and then afterwards use it as a depth buffer'. In D3D11 however you can't do this, you must create both the render target texture and the depth buffer as separate resources. This is also something that's very common on consoles, where you have direct memory access.

I can now see how this would save a lot of memory by reuse of already created buffers.

The way that it currently works with D3D11 is that you compile your shaders to D3D assembly, which is basically a hardware-agnostic "virtual" ISA. In order to run these shaders on a GPU, the driver needs to compile the D3D assembly into its native ISA. Since developers can't do this conversion ahead time, the driver has to do a JIT compile when the game loads its shaders. This makes the game take longer to load, and the driver doesn't have a lot of time to try aggressive optimizations. With a hardware-specific API you can instead compile your shaders directly into the hardware's ISA, and avoid the JIT compile entirely.

Currently its like: HLSL -> D3D assembly -> IL/PTX -> ISA

There would still be runtime compilation of shader to native ISA even if it was pre-compiled to IL or PTX, but certainly lower compilation time than D3D.

As for patching, the driver may need to patch shaders in order to support certain functionality available in D3D. As an example, let's say that a hypothetical GPU actually performs its depth test in the pixel shader instead of having extra hardware to do it.. This would mean that the driver would have to look at depth state is currently bound to the context when a draw call is issued, and patch the shader to use the correct depth-testing code. With a hardware-specific shader compiler you can instead just provide the ability to perform the depth test in the pixel shader, and totally remove the concept of depth states.

I am all for less magic happening inside driver and more power to the application developer.

The obvious use is the one they mentioned: culling and occlusion testing. Imagine that the CPU says 'draw all of this stuff', and then the GPU goes through that list and for each one performs frustum and occlusion culling. The GPU can then alter the command buffer to skip over non-visible meshes, and then when the GPU gets around to executing that part of the command buffer it will only draw on-screen geometry.

As for occlusion queries, the main problem with them in D3D/GL is that the data can only be read by the CPU but the data is actually generated by the GPU. The GPU typically lags behind the CPU by a frame or more so that the CPU has enough time to generate commands for the GPU to consume, which means if the CPU wants to read back GPU results they won't be ready until quite a bit of time after it issued the commands. In practice this generally requires having the CPU wait at least a frame for query results. This means you can't really effectively use it for something like occlusion culling, since by the time the data is usable it's too late.

So the GPU checks an object bounding box against the current depth buffer (or Early-Z buffer) and skips it entirely if its not visible. Sounds much faster than what GL and D3D currently offer.

Thanks a lot for these detailed explanations MJP!
 
Currently its like: HLSL -> D3D assembly -> IL/PTX -> ISA

There would still be runtime compilation of shader to native ISA even if it was pre-compiled to IL or PTX, but certainly lower compilation time than D3D.
It should be possible to compile to binary, i.e. detect which chip is in use and use the binary for that chip.

This already works with AMD's OpenCL (with fall-backs of various flavours for true robustness).
 
What I'm really looking forward to with the release of BF4 Mantle closely followed by the release of Kaveri is AMD rigs once more becoming relevant on review sites!!!

It will also be a blast watching how the review sites handle the myriad relevant permutations that arise.

GF4 Mantle performance on any combination of Intel CPUs, AMD CPUs, AMD non-HSA APUs, AMD Kaveri, AMD GCN GPUs, Nvidia GPUs.



I think Mantle will probably undermine Intel's massive IPC advantage and favor CPUs with a larger number of cores, such as AMD's FX8000 line. Something akin to Tomb Raider's results or even the current DX11 Frostbyte results but probably even more punishing for dual-cores and less dependant on clocks.


Kaveri is a two-module/quad-core design with most of its transistors dedicated to the iGPU. It's aimed at low-power PCs and laptops. It's not in AMD's plans to use Kaveri to compete for performance crowns.
 
So the GPU checks an object bounding box against the current depth buffer (or Early-Z buffer) and skips it entirely if its not visible. Sounds much faster than what GL and D3D currently offer.
Nah, you can do this already (since DX10 I think) with DrawPredicated. Obviously I expect Mantle to be more general in terms of command buffer control flow (although seriously... when are people going to take a step back and realize this queue model has to change in the long run), but you can already predicate Draw/Dispatch commands on arbitrary results in DX, and I assume similar in GL.
 
Nah, you can do this already (since DX10 I think) with DrawPredicated. Obviously I expect Mantle to be more general in terms of command buffer control flow (although seriously... when are people going to take a step back and realize this queue model has to change in the long run), but you can already predicate Draw/Dispatch commands on arbitrary results in DX, and I assume similar in GL.

Then what would you replace the command queue with? Even if a GPU can autonomously fill that queue, it's still a queue. IOW, any change that I think you might want to make will still look like a CPU->GPU command queue.
 
Nah, you can do this already (since DX10 I think) with DrawPredicated. Obviously I expect Mantle to be more general in terms of command buffer control flow (although seriously... when are people going to take a step back and realize this queue model has to change in the long run), but you can already predicate Draw/Dispatch commands on arbitrary results in DX, and I assume similar in GL.

I'm not really familiar with the specifics of modern GL at the moment, but as far as DirectX goes you only have the ability to predicate rendering on either the results of an occlusion query or stream-out overflow. With occlusion queries if you want fine-grained results you need to have lots of queries in flight and lots of draw calls. The end result is that predication is more limiting and inefficient than giving the GPU the ability control command submission/consumption directly through shaders.

Also, I'm not sure if predicates actually work with Dispatches in DX11. I've never tried it myself, and the documentation only refers to "rendering" when talking about predicates. Has anyone tried it before?
 
I'm not really familiar with the specifics of modern GL at the moment, but as far as DirectX goes you only have the ability to predicate rendering on either the results of an occlusion query or stream-out overflow.
Correct, but you can initialize a query to an arbitrary value using a pixel shader that conditionally discards. Not ideal of course but it's possible.

Also, I'm not sure if predicates actually work with Dispatches in DX11.
You don't use queries for dispatch, you use DispatchIndirect (arguably a bit less roundabout) with 0'd params.

Then what would you replace the command queue with? Even if a GPU can autonomously fill that queue, it's still a queue.
Why does there need to be one serialization point at all in the future? It's not like work stealing systems on the CPU haven't evolved beyond a global task list... nor do CPUs bake work queues into their hardware architecture to start with. Obviously rendering pipeline commands need some ordering setup (although less than current pipelines provide in most cases - i.e. primitive ordering), but there are other ways to set that up that doesn't end up with yet another serialization point in a few years or slapping a proprietary CPU on the front of the graphics pipeline. If the latter is really what people want for some reason, we have perfectly capable CPUs for that task already. We just need to drop this idea of super coarse-grained, long latency "command buffers" to move forward, particularly on SoCs.
 
You don't use queries for dispatch, you use DispatchIndirect (arguably a bit less roundabout) with 0'd params.
And you can do the zero parameter trick for draw calls as well (indirect draw with 0 instance count), but neither of these methods reduce CPU draw call / dispact cost at all. So you pay full price on CPU for occlusion culled objects. GPU cost of course gets lowered, since it doesn't need to render actual triangles/pixels, but the GPU front end still needs to process all the required GPU state changes and do the necessary setup work. And the GPU needs to receive the updated constant buffers and other buffer modifications that were done in order to setup data for the draw call that was ultimately skipped. Lots of unnecessary work was done.

... but this is just academic talk really. GPU occlusion queries in general don't work that well because of many reasons: The test geometry needs it's own draw call (you double your draw calls in the worst case, and never improve on them, unless you have more coarse test geometry, which will reduce your culling potential quite a bit). Also you pretty much need a depth pre-pass, to fill the depth buffer before you start doing your queries. The GPU predicate queries are just hints and the GPU is a high latency device. All the pixels from the queries needs to pass the full GPU pipeline (ROPs and all) in order to be visible at the front end (otherwise the query will just trivially pass - you have no control over this). You can't thus do test1->draw1, test2->draw2, etc. style rendering with GPU predicates. Thus the required depth pre-pass again doubles your draw call count and your triangle count.
Correct, but you can initialize a query to an arbitrary value using a pixel shader that conditionally discards. Not ideal of course but it's possible.
Unfortunately not even this hack is possible with the DX GPU predicate queries (D3D11 QUERY OCCLUSION PREDICATE), since it only contains true/false results, and it is also just a hint to the GPU (so it might cause a false positive -> draw when you don't want it). So this is definitely not useful for things that you want to work 100% correctly on every hardware/driver combination every time.
Why does there need to be one serialization point at all in the future? It's not like work stealing systems on the CPU haven't evolved beyond a global task list... nor do CPUs bake work queues into their hardware architecture to start with. Obviously rendering pipeline commands need some ordering setup (although less than current pipelines provide in most cases - i.e. primitive ordering), but there are other ways to set that up that doesn't end up with yet another serialization point in a few years or slapping a proprietary CPU on the front of the graphics pipeline. If the latter is really what people want for some reason, we have perfectly capable CPUs for that task already. We just need to drop this idea of super coarse-grained, long latency "command buffers" to move forward, particularly on SoCs.
I hope that future GPUs will have more flexible work schedulers. Most game/graphics engines have already moved to (micro-) task based systems. It would be great if GPUs had a similar model in the future.
 
And you can do the zero parameter trick for draw calls as well (indirect draw with 0 instance count), but neither of these methods reduce CPU draw call / dispact cost at all.
For sure, this was just in reference to the talk about command buffer "control flow" in Mantle, etc. That would not reduce CPU overhead either, it's purely GPUs making some late decisions. As I said, in the long run I don't see the need for all of this complexity and putting and increasingly-complex "processor" in front of the GPU vs. just fine-grained submission from our much-faster CPU cores. But I expect this to take its course in software/hardware design simply due to familiarity before it eventually gets back to that.

... but this is just academic talk really. GPU occlusion queries in general don't work that well because of many reasons
Sure, I think conservative CPU culling works a lot better TBH (either with small depth buffer rast or conventional methods). It's really quite cheap, and I expect it still to be a win with Mantle. CPUs do just fine generating depth buffers.

Unfortunately not even this hack is possible with the DX GPU predicate queries (D3D11 QUERY OCCLUSION PREDICATE), since it only contains true/false results
Used to be able to draw a 1x1 viewport surrounded by a query and conditionally discard the pixel, but that may have been in GL. This is just a stupid way of getting a bool into a "query object". I see in DX it's tied to depth/stencil test apparently, but you could still do that with a clever vertex shader... getting pretty silly of course, but my point was just that it's possible! :) In any case the drawindirect thing seems cleaner if you have arbitrary predicates in memory somewhere.

Sure it's just a hint, but that's just reality. You will always need to separate out culling sufficiently far from what you want to cull, or stall the pipeline.

Overall I agree though - using the GPU raster hardware is simply not a big enough gain to justify all of the caveats. Even with a sufficiently more powerful model (such as I expect Mantle to provide) I still probably wouldn't do it. Generally people just like to assume stuff is better on the GPU because "GPUs are fast" (heh) without really testing it out.
 
Last edited by a moderator:
Used to be able to draw a 1x1 viewport surrounded by a query and conditionally discard the pixel, but that may have been in GL. This is just a stupid way of getting a bool into a "query object". I see in DX it's tied to depth/stencil test apparently, but you could still do that with a clever vertex shader... getting pretty silly of course, but my point was just that it's possible! :)

Sure it's just a hint, but that's just reality. You will always need to separate out culling sufficiently far from what you want to cull, or stall the pipeline.
Sorry, I wasn't talking about CPU readable occlusion queries here. Those of course return exact pixel counts and stall the pipeline if result is not ready. GPU predicates (new in DX10/11) are only hints and only give you true/false results (skip or not skip the draw call by GPU). If you place the predicated draw too near the query geometry draw, it will be always rendered (so you get a "false positive"). And this of course depends on hardware/drivers how it goes (how much space you need between query and predicate).

In Mantle you can push GPU query results directly to GPU buffers, so you can read them in compute shaders (and thus prepare your indirect draw calls based on that info). And likely you can also insert GPU side fences to ensure that your queries are complete when you need their results in your UAVs.
 
If you place the predicated draw too near the query geometry draw, it will be always rendered (so you get a "false positive").
Yes I'm well aware. I'm just saying that dependency does not change in any culling setup... the pipeline always has to drain/be done before you can use the result in the front-end :)

In Mantle you can push GPU query results directly to GPU buffers, so you can read them in compute shaders (and thus prepare your indirect draw calls based on that info).
Sure, but that doesn't fundamentally change the pipeline drain situation either... those query results still need to be done before you can read them. You may be able to use fences but that's just making it slow instead of useless :)

Other than for data generated "late" on the GPU (say for instance with SDSM), I still don't think it's really worth it vs CPU culling right now. And in the long run I want stuff to be a lot more fine grained such that the CPU can be issuing much smaller workloads at once instead of these large, long-latency command buffers with lots of embedded complexity.
 
Why does there need to be one serialization point at all in the future? It's not like work stealing systems on the CPU haven't evolved beyond a global task list... nor do CPUs bake work queues into their hardware architecture to start with. Obviously rendering pipeline commands need some ordering setup (although less than current pipelines provide in most cases - i.e. primitive ordering), but there are other ways to set that up that doesn't end up with yet another serialization point in a few years or slapping a proprietary CPU on the front of the graphics pipeline. If the latter is really what people want for some reason, we have perfectly capable CPUs for that task already. We just need to drop this idea of super coarse-grained, long latency "command buffers" to move forward, particularly on SoCs.

The analogy with CPU task systems is interesting.

The best way forward in the near term, imo, is to have multiple compute and graphics queues. This allows enqueuing different opaque objects on different queues with the same rendertarget. This way, you can balance out the GPU much better than enqueueing shadowmap+compute.

This has not been mentioned for Mantle, but seems promising.

Considering all the dependencies between graphics calls, it seems unlikely that we would be able to have work stealing graphics queues, but work stealing compute and DMA queues sound promising.

EDIT: I believe HSA enables multiple compute queues for the GPU, though I am not sure if they are allowed to steal work.

IIRC, AMD said that they decided to keep dedicated queue hw in front of GPU's because it was more efficient than sw without sacrificing flexibility.
 
Overall I agree though - using the GPU raster hardware is simply not a big enough gain to justify all of the caveats. Even with a sufficiently more powerful model (such as I expect Mantle to provide) I still probably wouldn't do it. Generally people just like to assume stuff is better on the GPU because "GPUs are fast" (heh) without really testing it out.

Other than for data generated "late" on the GPU (say for instance with SDSM), I still don't think it's really worth it vs CPU culling right now. And in the long run I want stuff to be a lot more fine grained such that the CPU can be issuing much smaller workloads at once instead of these large, long-latency command buffers with lots of embedded complexity.

Agreed. CPU side occlusion testing (using a CPU rasterized depth pyramid) is the best solution for games that want to do scene setup (and ultimately draw calls) on CPU side. It is definitely not a good idea to submit a separate draw call for each object's occlusion test to GPU ("early out" CPU test is faster).

However CPU side occlusion culling (/ depth buffer rendering) doesn't help much with shadow map rendering, and this is where most of the draw calls come from in games that have dynamic lighting. Lighting determination (shadow map texels required for the current frame) needs exact z-buffering results, and for exact results you'd need to rasterize a few million triangles on CPU every frame (at full resolution). Conservative methods don't work here. And occlusion culling for shadow maps (depth pyramids generated by CPU for each shadow source/cascade) are naturally out of the question.

GPU occlusion culling is efficient, if you dispatch a single compute shader that checks the visibility of all the objects at once. This step takes just ~0.2 ms (modern GPU, big scene with ~half a million visible objects). The biggest problem isn't the culling itself, but the method to push those visible objects to screen without any CPU intervention (because CPU has no knowledge of the culling results). Naive approach would be to submit N indirect draw calls (where N = number of potentially visible objects) and generate the parameter lists for those indirect calls on GPU. GPU culling pass would zero out the triangle counts of those calls that are occluded. However this naive approach has some critical flaws: it doesn't reduce draw call count at all and it doesn't reduce the setup cost at all. CPU still needs to setup all the state and has to modify all the constant buffers for every draw call, because it has no knowledge whether that call is going to be dropped or not. Also GPU cost for these zero triangle draw calls is not zero either (all the state setup needs to be done regardless).

On a modern GPU, the shader gets a single pointer in, and it points to the start of the resource descriptor array. The layout of the resource descriptor array has been hard-coded (by the compiler) to the shader code itself, so the shader can fetch all the constant buffers, textures, etc it needs by using this single pointer (and the hard coded offsets). On all current APIs, this pointer comes from the CPU. On Mantle (if I understood the presentation correctly), you can pre-build these resource descriptor arrays and reuse them (DX doesn't support anything like this unfortunately). This means that you can generate the resource descriptor array for each different kind of draw/material combination in advance to the GPU memory, and just switch a single pointer to change the bind combination. This could be easily performed by the GPU as well. The GPU culling pipeline could setup a buffer filled with pairs of indirect draw parameters + resource table addresses (**).

Now there's only one missing part remaining in the puzzle: The draw call generation itself. Unfortunately current APIs (and GPUs?) do not have mechanisms for GPU to create the draw calls itself, so we are out of luck. Recent compute APIs (OpenCL 2.0 and CUDA) however have mechanisms to allow GPU create dispatch calls itself, so the future seems bright. We might one day get this quite soon.

So the options are: A) Submit large amount of indirect draw calls. Fortunately you don't need to change state between them, since a compute shader could setup the the resource descriptor array pointers for all visible objects (very efficiently)... or B) Submit a single indirect draw call that draws everything (but for that approach you need a few extra workarounds and compromises, but it also removes the draw call overhead completely from both CPU and GPU side).

** I don't know is the Mantle resource descriptor array contains any vertex and/or index buffers descriptors, since the shader code cannot access them directly. Workaround here is of course to use general purpose buffers instead, but then you lose post transform vertex cache. This is also one of the fixed function units I hope disappears in the future. A programmable post transform cache would be very handy for many reasons...
 
GPU occlusion culling is efficient, if you dispatch a single compute shader that checks the visibility of all the objects at once. This step takes just ~0.2 ms (modern GPU, big scene with ~half a million visible objects). The biggest problem isn't the culling itself, but the method to push those visible objects to screen without any CPU intervention (because CPU has no knowledge of the culling results). Naive approach would be to submit N indirect draw calls (where N = number of potentially visible objects) and generate the parameter lists for those indirect calls on GPU. GPU culling pass would zero out the triangle counts of those calls that are occluded. However this naive approach has some critical flaws: it doesn't reduce draw call count at all and it doesn't reduce the setup cost at all. CPU still needs to setup all the state and has to modify all the constant buffers for every draw call, because it has no knowledge whether that call is going to be dropped or not. Also GPU cost for these zero triangle draw calls is not zero either (all the state setup needs to be done regardless).

Yeah the current feature set available through D3D11 is still pretty limiting in this regard. I did some wacky stuff in a same app that used a compute shader to to cull meshes and batch them together (by generating a new index buffer) in order to avoid lots of DrawIndirect calls, and it wasn't exactly a fun experience. And I didn't even account for meshes that might need to use complex vertex shaders for skinning, or meshes that need pixel shaders for alpha test.

** I don't know is the Mantle resource descriptor array contains any vertex and/or index buffers descriptors, since the shader code cannot access them directly. Workaround here is of course to use general purpose buffers instead, but then you lose post transform vertex cache. This is also one of the fixed function units I hope disappears in the future. A programmable post transform cache would be very handy for many reasons...

SI doesn't use a resource descriptor for index buffers, it's all set via PM4 packets (see DRAW_INDEX/INDEX_BUFFER_SIZE/INDEX_TYPE in the SI programming guide).
 
Nothing really new there, just repeating what he ultimately would like to see happen with something like Mantle and IHV's. Can we really see Nvidia, the king of private closed APIs, working with AMD to provide drivers and support for Mantle? And can we really see AMD not charging licensing fees for something they've dedicated at least 2 years of R&D work for just so consumers can buy Nvidia cards for Mantle?

The answer I want for both is yes but reality is different in large corporate world.
 
Mantle requires a certain set of key functionality of the GPU, so it can’t be supported on older architectures before AMD’s GCN architecture
and if other vendors gpu's dont have that "certain set of key functionality" mantle cant be supported on them.
 
Nothing really new there, just repeating what he ultimately would like to see happen with something like Mantle and IHV's. Can we really see Nvidia, the king of private closed APIs, working with AMD to provide drivers and support for Mantle? And can we really see AMD not charging licensing fees for something they've dedicated at least 2 years of R&D work for just so consumers can buy Nvidia cards for Mantle?

The answer I want for both is yes but reality is different in large corporate world.

agree
 
Can we really see Nvidia, the king of private closed APIs, working with AMD to provide drivers and support for Mantle? And can we really see AMD not charging licensing fees for something they've dedicated at least 2 years of R&D work for just so consumers can buy Nvidia cards for Mantle.
So weird to me that Nvidia is the king of private closed APIs in a Mantle thread.
 
So weird to me that Nvidia is the king of private closed APIs in a Mantle thread.

Why is that weird? They have CUDA, PhysX, APEX (does GSYNC count?) just to name a few. All are closed APIs. This is really AMD's first seemingly closed API in recent memory? What was the last closed GPU functionality they had, npatches?
 
Back
Top