AMD Mantle API [updating]

Nvidia would be very good at doing what Oxide is doing, or its inherent overhead is low enough to give that much leeway for the analysis, or some combination of the two.
Maybe there's some data available on the performance of Star Swarm with various NVidia drivers, so we could see if there's been a change in performance.

I have no idea why Star Swarm uses so many draw calls. It may be separating work into draw calls for no good reason and NVidia has tuned into that. It could also be that there are certain kinds of parallelism in NVidia's GPU state (e.g. a simple ping-pong state change model, where a state change can be set up in hardware across the chip, while work for an existing state is still under way, then a simple flip cuts over "instantaneously" to the new state) that enables the GPU to move the bottleneck deeper, beyond the CP.

Perhaps NVidia has a near-stateless architecture, such that most work is distributed with piecemeal "state" solely for its own use? Not just bindless resources, but "bindless state".
 
The small batches problem is a giant reason why AMD used Oxide as a marketing tool for Mantle.
It just turns out that once things are opened up, that while much better than DX11, AMD is not the best at the use case it championed.
If you look at the available OpenGL 4.4 MDI (multi-draw-indirect) benchmarks results, you will notice that Nvidia clearly beats AMD. I believe the reason here is that Nvidia has a longer history of rendering techniques that allow the GPU to feed itself (they have had custom MDI extensions before MDI became a ARB feature). Nvidia has likely noticed that the draw call submission rate is a problem when the GPU generates a huge amount of draw calls very quickly with MDI. They have been aware of this bottleneck and have had several generations to improve their front end to reduce this bottleneck.

Mantle and DX 12 are the first APIs that actually make it possible to overload the command processor by the CPU. Now that the command processor is an actual bottleneck (in some cases), I believe that we see quite rapid improvements. Same happened when tessellation become a benchmark feature.
Indeed. Draw calls that are very small is a bad idea because there are other bottlenecks right behind the command processor. Partial wavefronts on AMD hardware for one. Probably partial warps on Nvidia too.
Yes, partial waves/warps are a real problem (waves more than warps, because waves have 64 vertices, and warps have 32). Once you solve the front end bottlenecks and start rendering 500k+ unique meshes per frame (at 60 fps or more), you immediately hit the fixed function primitive/vertex rate bottleneck. Huge majority of the rendered meshes must be very simple (less than 64 vertices) to avoid that bottleneck. But then the partial vertex shader waves/warps become a real problem (*). And it doesn't stop there. 500k visible meshes at 1080p = 4 pixels/mesh on average = lots of bottlenecks (macro tiles, quads, etc). Obviously in real games most of these meshes would end up being rendered to off screen surfaces (such as shadow maps), but I am sure we will see unrealistic benchmarks that try to render them all to the same back buffer (and hit gazillions of different GPU bottlenecks).

(*) You can solve the partial vertex shader warp/wave problem by rendering multiple meshes with a single draw call. It seems that Oxide has a CPU-based solution to this problem, but you can also do this entirely on the GPU.
 
Mantle and DX 12 are the first APIs that actually make it possible to overload the command processor by the CPU. Now that the command processor is an actual bottleneck (in some cases), I believe that we see quite rapid improvements. Same happened when tessellation become a benchmark feature..

Shouldn't this already be a well known problem thanks to console development that doesn't have to deal with the thick API?
 
Sounds like AMD are well aware of "Small Batchs" & "Command Processor Bottlenecks"

- The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mantle Chief Architect

the-small-batch-and-other-solutions-in-mantle-api-by-guennadi-riguer-mantle-chief-architect-23-1024.jpg
 
Sounds like AMD are well aware of "Small Batchs" & "Command Processor Bottlenecks"
Command processor was not at all a bottleneck before GCN. Thanks to Mantle (and OpenGL 4.4 MDI) it can now be a bottleneck for GCN in some cases. It is a "Future HW Consideration" to improve the command processor with small batches.
Shouldn't this already be a well known problem thanks to console development that doesn't have to deal with the thick API?
The old vec4+1 hardware in Xbox 360 is quite different from modern GPUs. PS3 GPU didn't even have unified shaders. On modern GPUs you can reduce a lot of state changes because of bindless resources, general purpose caches (allowing you to index efficiently to big buffers instead of CPU preparing constant buffers of limited size), and many other improvements. Because binding changes are no longer needed at full frequency, the draw calls become cheaper -> you can push more draws -> command processor can become a bottleneck. Of course rest of the GPU has also been getting wider at a rapid pace, so it is actually a valid use case to render considerably bigger amounts of objects.
 
Doing more than one of those approximation computations per work item could be faster if a constant buffer containing the constants was used (instead of #define). That would eliminate most of those MOVs. The compiler repeats them for each invocation...
 
I have no idea why Star Swarm uses so many draw calls. It may be separating work into draw calls for no good reason and NVidia has tuned into that.
Oxide's philosophy is one that emphasizes developer and artistic freedom, where existing methods of batching, combining texture resources, and minimizing state change can impinge on the ability to arbitrarily alter or add properties to arbitrary objects at arbitrary times, arbitrarily.
My suspicion on how Nvidia was able to claw back so much performance in DX11 is that most of the time this level of freedom is not utilized, leading to a large number of simple, identical calls that Nvidia could combine or pre-build.

Mantle and DX 12 are the first APIs that actually make it possible to overload the command processor by the CPU. Now that the command processor is an actual bottleneck (in some cases), I believe that we see quite rapid improvements. Same happened when tessellation become a benchmark feature.
I'm not enthused with the possibility that AMD improves its front end like it has handled tessellation. There are still some baffling performance behaviors years into that.
 
My suspicion on how Nvidia was able to claw back so much performance in DX11 is that most of the time this level of freedom is not utilized, leading to a large number of simple, identical calls that Nvidia could combine or pre-build.
Nvidia seems to have fast path for draw calls that do not change state. This is helpful if you are trying to push huge amount of identical draw calls on DX11. However, this would practically be identical to a single multidraw call, but cost a lot more to the CPU (even with perfect driver optimizations in place). This is an excellent solution for prototyping (when iteration time is more important than peformance), but I wouldn't be comfortable releasing a game like that. They need to implement some kind of software batching approach (CPU or GPU) if they are going to release games using that engine for DX11 customers. Currently their tech seems practically useless for DX11 (if the final game is going to have similar content).
 
Nvidia seems to have fast path for draw calls that do not change state. [...] This is an excellent solution for prototyping (when iteration time is more important than peformance), but I wouldn't be comfortable releasing a game like that.
Are we seeing this in extant games, as apparently NVidia has some kind of CPU-scaling advantage? Despite it being sub-optimal in terms of overall engine design?

Is there a chance this is a side effect of some engines being built upon NVidia as the primary target?
 
Are we seeing this in extant games, as apparently NVidia has some kind of CPU-scaling advantage? Despite it being sub-optimal in terms of overall engine design?

Is there a chance this is a side effect of some engines being built upon NVidia as the primary target?

You can see the differencial between Nvidia and AMD in the graphics here:
http://www.g-truc.net/post-0666.html#menu
Under "X architectures behavior against small triangle count per draw call".

Now you can contrast this with an earlier investigation from Nvidia on older hardware ofc:
http://www.nvidia.com/docs/IO/8228/BatchBatchBatch.pdf

Reading between the lines in the GDC paper, Nvidia did pay attention to these issues. Just look how different the behaviour between now and back then is.
AAA Games are tightly tuned to bottlenecks, the presets are tightly tuned to specific hardware profile's bottlenecks. If you tune you shift the optimization process (LODs, texture sizes, etc.) from local minimum to local minimum. If you only profile on Nvidia hardware ofc you'll find a Nvidia specific local minimum (fill rate, triangle rate, z-buffer rate, geometry-shader rate, etc.).
 
Thanks Dave, I guess that was inevitable and good that AMD are addressing it early on instead of dragging it out further.
 
OPEN Mantel is D E A D.

http://www.pcper.com/news/Graphics-...ight-Be-Dead-We-Know-It-No-Public-SDK-Planned

AMD's Mantle API in it's "1.0" form is at the end of its life, only supported for current partners and the publicly available SDK will never be posted.

...

AMD claims to have future plans for Mantle though it will continue to be available only to select partners with "custom needs."

So, it turns out that Mantle was never an OPEN standard (API never released) and will be a CLOSED API used only for select partners.

Mantle’s definition of “open” must widen. It already has, in fact. This vital effort has replaced our intention to release a public Mantle SDK, and you will learn the facts on Thursday, March 5 at GDC 2015.
AMD spinning closed API as open. Too funny. :D
 
It's not closed, it's a redefined open.

I do think that, at some point, they wanted it to be open. But an API whose claim to fame is closing an efficiency gap is doomed from the start to be short lived. Once the competition does the same, there's nothing left. They probably realized this quite a while ago but forgot to tell their fans (and The Scientist.)

Either way, we'll never know if DX12 was a reaction to Mantle or not, but if it was, let's remember Mantle as a catalyst of improvement.
 
Last edited:
We may as well wait for March 5 to see if any additional data comes out. That date is significant to almost everyone else as well.

If it goes as many think it will, then the change in direction would make sense.
That some partners could still find use for Mantle may point to custom designs whose development period would be well-advanced by the time the next-gen APIs AMD is pointing developers towards can launch, or possibly some offer for more proprietary changes and tweaks outside of the mainstream after that.
 
Anyone could have looked at AMD's marketshare and seen that Mantle could never coexist with DX12 and the next OGL. But Mantle served its purpose and I'm glad it existed.
 
It's not closed, it's a redefined open.

I do think that, at some point, they wanted it to be open. But an API who's claim to fame is closing an efficiency gap is doomed from the start to be short lived. Once the competition does the same, there's nothing left. They probably realized this quite a while ago but forgot to tell their fans (and The Scientist.)

Either way, we'll never know if DX12 was a reaction to Mantle or not, but if it was, let's remember Mantle as a catalyst of improvement.

Perhaps some of the non-technical events surrounding AMD and its own internal changes may have had an influence on the timing.

I do look forward to seeing what documentation will be opened up for Mantle. It might shed light on what else was going on in terms of features or goals that the Mantle effort was striving towards over the last year. More information is good, even should this be a reveal that is a bit more archeological than may have been planned at the outset.
 
So Mantle evolves into Vulkan. This is a best case scenario for AMD. Hope they can capitalize.
 
Back
Top