AMD Mantle API [updating]

Bindless textures and hardware virtual memory (*) allow rendering in larger batches, thus increasing the GPU utilization (GPU partially idles at start/end of draw/dispatch calls).
I'm gonna hit this point later because the discussion of bindess in the developer Q&A here (http://www.brightsideofnews.com/new...ed-with-qa-with-devs-from-oxide-and-dice.aspx) was pretty disappointing. It's clear that you get it, but I'm not sure the folks on the panel had thought about it too much.

With Mantle you can run multiple kernels in parallel (or kernel + graphics in parallel) in a controlled way, and thus reduce the GPU bottlenecks. For example render a shadow map (mostly ROP and geometry setup) and compute ALU heavy pass (for example lighting for the previous light source) at the same time.
Not sure if you saw, but I mentioned this in my post as well. The issue is this the only example people have given of cases where you're completely not utilizing the ALU array... rendering depth-only. So there's a nice one-time boost we can get during shadow map rendering, but it's not a long-term performance amplifier per se. Also the more power-constrained a GPU is, the less this will help.

Better predicates and storing GPU query results to GPU buffers (without CPU intervention) allow GPU optimization techniques that are not possible with PC DirectX.
They are possible (you can set up arbitrary predicate data), but they are ugly no doubt. I've asked for this in the past as well.

I was quite worried that SLI/Crossfire would die soon, as many new graphics engines will start doing scene management and rendering decisions on GPU side.
As much as I agree that SLI is totally dead in the water without explicit separate device enumeration, I'm not sure even that will save it. The market is just too microscopic to bother doing what would be quite significant work to make an engine take good advantage of it. I expect a few cases where developers are paid to do it for benchmark wins, but I don't expect much more than slightly-tweaked AFR for most games, and I'm not sure the PC market will tolerate that extra frame of latency going forward. We'll see.

Deferred antialiasing will be much more efficient, assuming the "Advanced MSAA features" in the Mantle slides means that you have direct access to GPU color/depth blocks and MSAA/layer data (including coverage sample index data). With all that data available, tiled (and clustered) deferred renderers can separate pixels (different sample counts) more efficiently and recover geometry edge information (using coverage samples) in a much more precise and efficient way.
Yeah I spoke to this in my 2010 talk. While it's neat, it's a bit tricky to decide exactly what to expose and that's why I've been asking Johan for more info on what GPU features will be exposed :) Compression is one area of high variance between IHVs, so you could not expose the specifics of GCN there and still call it portable. I also highly doubt AMD would be comfortable with pinning down the specifics of their color/depth compression forever (forward compatibility). There are somewhat portable ways that you can expose a lot of the features that deferred MSAA needs, but we'll see what they do in Mantle when the details are public (next year? early hopefully?).

With Mantle rendering a depth pass to prime the depth buffer in a Forward(+) renderer becomes a completely viable option without running into CPU bottleneck situations.
The GPU side is an issue regardless of the CPU side. You don't want to retransform/skin/rasterize your entire scene twice, even if the CPU cost was free. Honestly in the long run pre-z is going to lose out to coarser low-res (and low latency) conservative depth rasterization (or equivalent ray tracing) and in a similar vein clustered shading (i.e. GPU pulls data from an light acceleration structure) is going to win for those who want forward rendering engines. There are cases where stuff like tiled forward (aka. forward+) will make sense in the meantime, but mostly because of legacy code. Emil made a pretty compelling argument for why it makes more sense to generate the light acceleration structure before submission, even if submission is cheap.

Forward+ is a good solution if your triangle counts are not that high and you don't use heavy vertex animation or tessellation. In low polygon games (current gen ports for example) the depth only pass is dirt cheap (for GPU), and quad efficiency is often more than 80% (so you lose less than 20% of your lighting performance). But I just don't see it as a viable technique in the future, especially now as Mantle allows low level access to the GPU MSAA data on PC as well (this is a huge gain for the deferred renderers).
^ What he said :) Thanks for going into more detail. Fundamentally I don't think Mantle changes the arguments I made in my 2012 talk on tiled forward vs deferred (other than potentially making deferred MSAA less expensive); I was making no assumptions about CPU overhead in that analysis.

***

Now back to the bit on bindless... I really wasn't too happy with the answers given in the Q&A. I'm going to write it off to people not having thought a lot about it and being somewhat conditioned in their engine design thinking by the way that APIs have always been to this point, but they really missed the point of why people are bringing up bindless in this context.

Bindless textures basically remove the last piece of state that changes at high frequencies in engines and thus "breaks batches". sebbbi has mentioned this before, but if you want to you can completely render at least an entire pass to a set of render targets with one draw call using bindless. Thus the overhead of draw calls is largely irrelevant... even DX today has quite acceptable overhead if you're only talking about tens-of-draw-calls per frame.

And in doing this, you are giving up no flexibility vs. the low-overhead small batches Mantle approach. The results are identical and arguably it has even lower CPU overhead than Mantle. The only differences are whether you're pushing vs. pulling state and whether the effective command list/buffer is hidden behind an API or just implemented as an arbitrary user-space concept.

Now in theory there can be some additional GPU overhead for bindless, depending on how it is implemented and the GPU architecture. Pushed state is usually easier to deal with in hardware, since pulled state typically requires caches. That said, we're already long past the point where we've transitioned most of the fixed pipelines to generic cached memory accessess (and GCN more than most), so I'm not really concerned about this in the future. You might have to make your constant cache a bit bigger, but you get back the same space in the lack of other on-chip buffers that used to hold the binding table. It's really not very different.

Basically with pure bindless you can implement the "descriptor set" concept in user space, or however else you want to. In some cases it might make sense to have a simple list of them - such as in Mantle - in others it might make sense for them to be part of some deeper geometry data structure as is necessary if you want to "pull" arbitrary data and do shading more dynamically - such as with ray tracing.

The other advantage of the "pull" bindless model is better distributed scheduling. You effectively parallelize the work of pipelining the relevant state. They touched on this briefly in the Q&A in the question about "moving bottlenecks", but the Mantle model does not scale indefinitely. Even with multithreaded command buffer *creation*, they are still consumed by the GPU front-end in a fairly sequential manner. The front-end at a minimum has to deal with setting up the pipelined state (on GCN I think this takes the form of something like a memcpy of the relevant state vector). Either this is a copy of the "descriptor set" itself - in which case you're talking some number of cycles to copy it - or it's a pointer to the descriptor set. If the latter and later stages "pull" from it, the question arises as to why we even need that concept in the API since it could as easily be a user-space data structure. I imagine for that reason that some parts of it are indeed pushed through the pipeline.

But as the AMD guy hinted at, if you want another 10x gain on the 100k batches (Dan mentioned wanting millions in a few years), you're going to be pretty close to the limit of what you can do with current GPU frontend design (i.e. small number of cycles per state change/draw)... and parallizing the front-end would probably start to look a lot more like the pull model anyways bringing us back to the question of if we really want this "pushed" state around in the first place.

So yeah, I admit that the question the guy asked was perhaps not specific enough, but I would have wanted to hear a better discussion of push vs pull in the future. I just don't think most game devs have necessarily thought that far out yet. Don't get me wrong, I'll take both lower overhead and bindless, but it's not clear that we really need to go nuts on the former given that we really need the latter regardless.

Sorry for the long post; brain dump :)
 
Last edited by a moderator:
Not sure if you saw, but I mentioned this in my post as well. The issue is this the only example people have given of cases where you're completely not utilizing the ALU array... rendering depth-only. So there's a nice one-time boost we can get during shadow map rendering, but it's not a long-term performance amplifier per se. Also the more power-constrained a GPU is, the less this will help.

Great post, Andrew.


The gbuffer pass has free pixel shaders too. I think a more important point is that from now on, your shader array will never be idle.
 
The gbuffer pass has free pixel shaders too. I think a more important point is that from now on, your shader array will never be idle.
G-buffer generation still needs to run pixel shaders in most cases, but indeed in some cases you won't be PS bound. That said, it's not clear to me if ROP or texture-bound cases can benefit from this overlapping on GCN. Texture in particular would be tricky, but ROP might be doable depending on the details of the reordering/scoreboarding (i.e. when the shader entirely retires).
 
I am looking forward to seeing how Mantle adopters will be using the power available to them to optimize their engine once they've had more time to play with the API.
Hi Nick!

Let me tell you that I like Mantle, but what about the hardwares. If the GPU utilization will be higher, than a Mantle code can produce a Furmark-like heavy load in a game. Are the driver guys aware about this situation?
 
Hi Nick!

Let me tell you that I like Mantle, but what about the hardwares. If the GPU utilization will be higher, than a Mantle code can produce a Furmark-like heavy load in a game. Are the driver guys aware about this situation?

I don't mean to speak for Nick, but this is what PowerTune is for.
 
Some of you may have seen that we announced our intention to support AMD’s Mantle with Star Citizen. We didn't do this because AMD sends us lots of high cards (although that doesn’t hurt). We are doing this because it increases the ability of a PC to get the most out of its incredibly powerful hardware. Going to the hardware without an huge inefficient API like DirectX allows us to radically increase the number of draw calls in a frame – At last week’s AMD developer conference Nitrous, which is a new company working on a next gen PC engine, demoed a scene with over 100,000 drawcalls per frame running at over 60 FPS through Mantle. To put that in context last gen stuff (and a bunch of PC games gated by DirectX) have been stuck around 2,000 - 3,000 drawcalls and next gen consoles (like PS4) can do 10,000 - 15,000 or so. We’re supporting Mantle to push PC graphics performance higher – it’s been gated too long by DirectX’s inefficiency and abstraction, which has only gotten worse as Microsoft becomes less interested in the PC as a gaming platform. I would love NVidia and Intel to have Mantle drivers (as the API is designed to be non GPU architecture specific) but if not we would support NVidia or Intel drivers that would allow us to get to the metal (GPU Hardware) efficiently and take advantage of parallelism in CPU cores (for efficient batching of data between the game and the GPU).

This is ironically the advantage the next gen consoles have like PS4 and Xbox One – they abstract the low level hardware much less, so what is essentially a mid-level gaming PC of today (which are what the PS4 and Xbox One specs are) punches above its performance weight while Windows and Direct X do a nice job of handicapping the high end PC.

I'm supporting Mantle to push the PC as a gaming platform even further and negate one of the advantages of a console over a PC. Hardly the actions of someone about to sell out! :)

https://forums.robertsspaceindustries.com/discussion/76653/star-citizen-pc-ps4-and-consoles

for the second bold: then they are or not specific to GCN? And if they are agnostic as it seems, can simply be customized OGL?
 
Wow that's some serious Mantle support. Very positive stuff! And 100,000+ drawcalls per second compared with 15,000 on the next gen consoles? How things change eh?
 
for the second bold: then they are or not specific to GCN? And if they are agnostic as it seems, can simply be customized OGL?
At this point developers saying "it's portable" is basically meaningless. We need clear statements from AMD about what their plans are for standardization, development of the API, etc. or it's no different than when people were saying that CUDA was portable.

Frankly, while I respect the guys making the statements, I wouldn't consider most of them experts on low-level NVIDIA and Intel GPU architecture. Making something portable always requires a willingness to compromise or alter various specifics. Until AMD makes it clear that they are willing to do that (and they are not willing to do that for Mantle 1.0, considering games will ship using it soon), the point is moot.

Take these statements as what they are: developers expressing their desire for better multithreading and performance for small batches on the other major graphics hardware platforms, and I'm pretty sure that has been heard loud and clear. Whether Mantle or another API provides that is more politics than tech, but ultimately it doesn't really matter.
 
Last edited by a moderator:
Well be it From Dice or Oxyde all the talk about Mantle being portable or open is wishful thinking.
From their presentation it is clear that the core of Mantle is portable though I don't think it makes sense for AMD to open it now. They are in a good situation thanks to their wins in the console realm, Mantle should be even more beneficial for mobile devices, they would be dumb to not leverage it.
 
Well be it From Dice or Oxyde all the talk about Mantle being portable or open is wishful thinking.
From their presentation it is clear that the core of Mantle is portable though I don't think it makes sense for AMD to open it now. They are in a good situation thanks to their wins in the console realm, Mantle should be even more beneficial for mobile devices, they would be dumb to not leverage it.

The best business decision would be exactly that, let the rest of the industry work for it, specially those that aren't on HSA Foundation. Leverage your position on both HSA Foundation and Gaming Ecosystem benefiting AMD and later the rest of the HSA Foundation members.
 
At this point developers saying "it's portable" is basically meaningless. We need clear statements from AMD about what their plans are for standardization, development of the API, etc. or it's no different than when people were saying that CUDA was portable.

Frankly, while I respect the guys making the statements, I wouldn't consider most of them experts on low-level NVIDIA and Intel GPU architecture. Making something portable always requires a willingness to compromise or alter various specifics. Until AMD makes it clear that they are willing to do that (and they are not willing to do that for Mantle 1.0, considering games will ship using it soon), the point is moot.

Take these statements as what they are: developers expressing their desire for better multithreading and performance for small batches on the other major graphics hardware platforms, and I'm pretty sure that has been heard loud and clear. Whether Mantle or another API provides that is more politics than tech, but ultimately it doesn't really matter.

Does it even make sense for AMD to first offer Mantle to Intel and Nvidia?

I remember Johan talked about the need of some common standard in the mobile space under the Q&A part of the Nvidia conference with Carmack and co. On Johans slide 33 from APU13, he mentions mobile SOC vendors, google and Apple. Shouldn't the co-founders of HSA foundation be a natural first target if AMD wants to make this an industry standard? Samsung, Qualcomm, Texas instruments etc?

Slide 33:
http://www.frostbite.com/2013/11/mantle-for-developers/

Another thing I wanted to ask you especially, considering your insight in your previous frametime thread:

AMD and the developers spoke a bit about lower latency, the developers being able to do the sync instead of the GPU drivers and a lot of other things above my head. With the added control developers supposedly gets, could this translate into smoother gameplay?

Microsoft does a "G-sync" kinda thing with resolution on Xbox one:
http://gearnuke.com/xbox-one-already-has-an-answer-to-nvidia-g-sync/

Could Mantle take this even one step further now that it only have to deal with a thin driver?
 
Does it even make sense for AMD to first offer Mantle to Intel and Nvidia?
It's not a case of "offering" anything... licensing wouldn't be the issue, it would be giving up control. And indeed it may not make sense for them to do that. Indeed that's what I have been expecting all along, but the game dev comments urging everyone to "adopt" Mantle itself seem to indicate they are not thinking the same thing...

Shouldn't the co-founders of HSA foundation be a natural first target if AMD wants to make this an industry standard?
In it's current form it is not related to HSA. In fact it is far more tied to Windows via HLSL right now (vs. HSAIL or equivalent). Of course that could change, but for now the obvious targets are other DX11 cards.

AMD and the developers spoke a bit about lower latency, the developers being able to do the sync instead of the GPU drivers and a lot of other things above my head. With the added control developers supposedly gets, could this translate into smoother gameplay?
I'd need to know the context of the "lower latency" statement. Certainly any performance improvements can translate to smoother gameplay and less driver magic is certainly welcome in that context as well. To put it another way, if the game isn't smooth using Mantle it's more likely the application's fault than with DirectX :)

Microsoft does a "G-sync" kinda thing with resolution on Xbox one:
http://gearnuke.com/xbox-one-already-has-an-answer-to-nvidia-g-sync/
Yeah, not related to G-sync at all. In fact dynamic resolution rendering is neither new (http://software.intel.com/en-us/vcsource/samples/dynamic-resolution-rendering and earlier cases too), nor does it require special hardware. Windows 8.1 already includes some built-in support for dynamically-resizing the swap chain, so I'm not sure why people are even calling this a "feature" on Xbox One.

In terms of the real advantages of G-sync - i.e. smoothness below 60fps - that absolutely requires special hardware in both the display and GPU.
 
To speak the truth, none of us speak in behalf of AMD, so we can't say neither if AMD's board want nor if it doesn't want to do an specific business decision.

I would eagerly call for someone close to the matter (repi ;)) to speak more about the technical side of Mantle if he could, specially any of the exposed GPU features. ;);)
 
I'd need to know the context of the "lower latency" statement. Certainly any performance improvements can translate to smoother gameplay and less driver magic is certainly welcome in that context as well. To put it another way, if the game isn't smooth using Mantle it's more likely the application's fault than with DirectX :)

Thank you for the answer. :) I can't get the whole context now, since I'm currently at the Canary Island packing for return home tomorrow and some are from video presentations.

One context of the reduced latency is the parallell dispatch (slide 16).Others are the explicit Multi GPU slide 27. You talked a bit in your thread about CPU spikes being more common, while GPU spikes were more rare due to having more predictable latencies.

http://www.frostbite.com/2013/11/mantle-for-developers/

I'm not only thinking about the latency, but the whole new Mantle pipeline in a frametime perspective. According to your thread about frametimes, you describe the need for consistency and control through the whole pipeline.

Could the control given in the Mantle pipeline (at the APU13 Q&A section, they talk a bit about being able to shift the bottlenecks in Mantle from CPU to GPU in a large degree) make it more possible for game developers to predict and syncronize game engine, frametime, and framerate for a more consistent throughoutput vs the refreshrate?

Would it be more possible for a developer using Mantle targeting and maintaining a steady framerate (say 60 fps at 16.7ms consistently) vs. DX or OpenGL, provided the claims (draw calls, new methods, more control etc) given by AMD, Dice and other developers so far is true and works as advertised?

I'm sorry if I'm formulating this a bit clumsy, As you pointed out well in your thread, frametime is more important then higher framerate for smoothness and it would be fun if you would have any views if Mantle could contribute something on that front.:)

Edit: See also slide 4 from Eidos/Thief with Mantle in UE3:
Last bullet point
"PC gamers will get smooth gameplay! No unexpected stalls due to shader creation or state changes that triggers driver recompilation."
http://i.imgur.com/nt5217p.png
 
Last edited by a moderator:
There are at least three areas where Mantle can help with frame latency or stuttering issues.
The first is generally higher performance, meaning a higher percentage of frames (or all frames depending on workload/HW configuration) rendered at or above the monitor's refresh rate. The more time spent above 60 Hz the more stability in frame rate.
The second is what Jurjen from Nixxes mentioned in his presentation: no stuttering due to runtime shader compilation or similar events that would usually be triggered by the DX runtime. (Some of it can be alleviated by pre-warming your shader cache in DX but in practice catching all shader permutations is difficult for an engine).
The third is multi adapter support, namely the ability for more than one GPU/APU to process graphics or compute workload for the same frame (as opposed to current Alternate Frame Rendering Multi-GPU solutions that increase input lag).
 
^ What Nick said :)

Collaborating on a single frame instead of AFR could potentially make multi-GPU more interesting to me, but it might be a significant amount of work, particularly if one wants to support arbitrary asymmetric configurations. It will be interesting to see if developers put in the effort for what has traditionally been a fairly small market. Perhaps in the context of VR stuff (which is both latency sensitive and requires very high performance) it might be worth some more effort and there's some natural parallelism between eyes to exploit there too (although one can be more clever than just rasterizing each eye independently).
 
Last edited by a moderator:
At last week’s AMD developer conference Nitrous, which is a new company working on a next gen PC engine, demoed a scene with over 100,000 drawcalls per frame running at over 60 FPS through Mantle. To put that in context last gen stuff (and a bunch of PC games gated by DirectX) have been stuck around 2,000 - 3,000 drawcalls and next gen consoles (like PS4) can do 10,000 - 15,000 or so.

It looks amazing, I want some real benchs
 
Back
Top