AMD Mantle API [updating]

There were some benchmarks Valve bragged about comparing L4D2 GL/DX. GL was around 12% faster on Windows. I'm guessing most of it is due to using NV GL extensions.

By reworking the GL renderer today (maybe also using modern extensions, dunno if bindless or not), as compared to the original D3D *9* version (talking about overhead..), optimized for 2009 hardware...
 
Not sure if you saw, but I mentioned this in my post as well. The issue is this the only example people have given of cases where you're completely not utilizing the ALU array... rendering depth-only. So there's a nice one-time boost we can get during shadow map rendering, but it's not a long-term performance amplifier per se. Also the more power-constrained a GPU is, the less this will help.
That was just an easy example, because shadow map is triangle and ROP bound, while the compute step is likely bandwidth and ALU bound. But this is not the only case where it helps to run two kernels (or kernel + rendering) simultaneously. For example if you have two kernels, one is ALU bound and one is BW bound, manually allocating the GPU resources between these two could bring a performance boost, since the bandwidth (and L2) will be shared between the whole GPU, and thus the BW hungry part can use more than half (lets say 65% of the BW), while the ALU hungry only uses (35%). Now if you allocate the CU (ALU) resources 50/50 between these two tasks, the BW heavy task will finish in 1.3x time, and the ALU heavy task will finish in 1.3x+0.35x = 1.65x time. Total time savings of 0.35x (17.5%). Of course 50/50 split isn't even the best fit here, since you would want to allocate more CUs to the ALU heavy task. But this is of course pure speculation, since we don't know yet how fine grained access to GPU resource allocation between concurrent tasks you have on Mantle.
I really wasn't too happy with the answers given in the Q&A. I'm going to write it off to people not having thought a lot about it and being somewhat conditioned in their engine design thinking by the way that APIs have always been to this point, but they really missed the point of why people are bringing up bindless in this context.

Bindless textures basically remove the last piece of state that changes at high frequencies in engines and thus "breaks batches". sebbbi has mentioned this before, but if you want to you can completely render at least an entire pass to a set of render targets with one draw call using bindless. Thus the overhead of draw calls is largely irrelevant... even DX today has quite acceptable overhead if you're only talking about tens-of-draw-calls per frame.

(...)
I agree with you. Bindless is an enabler of GPU-driven rendering, not a minor performance boosting feature. Yes, you can use it in a "wrong" way and get a minor 20% boost to your traditional CPU-driven pipeline. And that's fine if you want to submit draw calls using CPU. But submitting (and rebuilding) 100k draw calls per frame using CPU is just a huge waste of CPU resources. You need to use multiple CPU threads just to push draw calls. Same stuff every frame, again and again (only changing the data set by a few percents every frame).

A GPU-driven pipeline on the other hand doesn't need more than a single indirect draw call (or a fixed amount of them) to render a whole scene (sidestepping the whole draw call cost issue entirely). Tiled resources ("virtual textures") and bindless resources are absolutely necessary here, because the CPU doesn't know anything about the rendered scene (GPU does the whole scene management). Thus CPU cannot change any resources, vertex/index buffers, constant buffers or textures. GPU needs to pull this data on it's own. Without bindless resources and virtual paging ("virtual texturing" is an outdated term, since you can use similar technique for meshes and constant data as well) GPU-driven rendering is pretty much impossible. Software based virtual texturing (and virtual paging in general) is of course is usable on all platforms, and that's a good solution for many cases (only textures require filtering between pages and require shader trickery).

Bindless has some advantages over hardware PRT as well. The biggest advantage is the flexible data format. Every resource descriptor can point to data with different format (different BC formats, floats, integers, packed 11-10-11, etc). PRT on the other hand has a fixed format. If some of your texture pages would need better quality (uncompressed or different BC mode) that is not possible. Also bindless resource descriptors can point to different sized resources (with different mip counts, etc) and there's proper border wrapping/clamping for the filtering as well. These are quite handy features.

I think the question panel gave some incorrect information about the bindless texture GPU cost. The bindless textures cost basically no GPU performance at all. It's not a trade off, when you use it correctly. This is because modern hardware has scalar units in addition to SIMD units, and scalar units can be used to fetch wave/warp coherent data, such as constants. The scalar units also can do any kind of wave/warp coherent calculation, such as multiplying to constants together, performing calculations that lead to branching decisions (branching has wave/warp granularity), or calculating addresses to scalar fetches... For the scalar unit it doesn't matter if a resource descriptor is fetched from a hard coded address (CPU side binding) or a calculated address (bindless). It's a single scalar ALU operation + a single scalar fetch more per wave/warp (= 64 threads on AMD / 32 threads on Nvidia). I have never seen a scalar unit bound shader in my life... so all this is very likely masked out completely = free.

Bindless would cost performance in a case where you have wave/warp coherency problems. A single GPU data fetch instruction fetches data to all (64 or 32) threads in the wave/warp at once from a single resource. If you would want each thread to fetch data from a different resource, you would need to serialize the execution (similar penalty to branch divergence I suppose). I don't know if this is entirely true for most recent Nvidia hardware, since the OpenGL bindless prototype performance results I investigated several years ago were running on Fermi. That prototype was showing huge performance drop in a case of incoherent bindless resource access patterns. However if you can guarantee that each thread in the same wave/warp fetches data from the same resource (for example resource index comes from a constant buffer, or you have manual control over it in a compute shader), you should have (very near to) zero GPU performance cost. If warp/wave granularity is not guaranteed, the shader compiler seems to do some black magic to cover you up (but at a big cost). I have to say that I don't have a clue how this works on Intel hardware (modern AMD and Nvidia hardware seem quite similar in this regard).

100k draw calls by Mantle seem like a huge number. Interesting question is, how many CPU cores you need to dedicate fully to rendering in order to submit this many draw calls? And in the long run, how the performance compares to a fully GPU-driven pipeline that completely sidesteps the draw call overhead issue (bog standard PC DirectX 11 API is enough for HUGE object counts).
 
Last edited by a moderator:
^ Nixxes, Anderson, and even Chief Mantle ahitect mentioned in their seperate presentations that Mantle is not made only for GCN [i watched video presentations from all of them, except Oxide], it can be adapted to other vendors. Shame that internet journalists choose to ignore that fact.
 
^ Nixxes, Anderson, and even Chief Mantle ahitect mentioned in their seperate presentations that Mantle is not made only for GCN , it can be adapted to other vendors. Shame that internet journalists choose to ignore that fact.
I don't think that, by now, anyone doubts that Mantle, from a pure technical point of view, can be adapted to other vendors just the way it can be done with PhysX.

Whether that matters is something else entirely.
 
For example if you have two kernels, one is ALU bound and one is BW bound, manually allocating the GPU resources between these two could bring a performance boos
Sure, but it's not clear you can do this in Mantle/GCN. The only confirmed case is being able to use the ALU array for compute while it is completely unused by 3D. It may be more general than this (since the key to all of this is that compute effectively has no state), but just not sure.

I think the question panel gave some incorrect information about the bindless texture GPU cost. The bindless textures cost basically no GPU performance at all. It's not a trade off, when you use it correctly.
Right I didn't go so far as to make this claim since I haven't had time to re-check the GCN docs, but my recollection is that this is already how texture accesses work on GCN, even with binding. i.e. the driver sets up a set of surface state structures, scalar unit loads them in as constants to the shader and the majority of the information the sampler needs actually comes from the GCN registers. Thus I don't get why they care about abstracting the "descriptor set" concept since there's no reason they have to be continuous in memory or part of the same array indexing. Seems like needless complexity to me based on legacy thinking about bind points, but I'm guessing it's mainly to avoid having to overhaul the shading language as well.

Bindless would cost performance in a case where you have wave/warp coherency problems. A single GPU data fetch instruction fetches data to all (64 or 32) threads in the wave/warp at once from a single resource. If you would want each thread to fetch data from a different resource, you would need to serialize the execution (similar penalty to branch divergence I suppose).
Right, but this is a case of things you simply require bindless to do at all. i.e. it's clear that we need bindless in the future, GCN hardware already is effectively bindless, so why do we keep trying to bolt binding on? Anyways, enough said on that.

That prototype was showing huge performance drop in a case of incoherent bindless resource access patterns.
I tested at one point on Kepler and it seemed to fall of directly with divergence (i.e. 1/2 speed with 2 different handles, 1/4 with 4) as expected. I didn't spend a long time with it though so I could have made a mistake in the test.

Honestly in most cases this isn't an issue. It's sort of like MSAA... most cases are going to be the fast path so as long as that is full speed, having the harder cases run slower for a few pixels is not an issue, as long as they still work and generate the correct output.

100k draw calls by Mantle seem like a huge number.
It does and it's certainly faster than DX, but the 2-10k thing is slightly unrealistic as well. I re-ran some tests and on NVIDIA w/ just the immediate context you can get ~40-80k depending on which state changes you do (including updating some dynamic constants for each w/ discard). Obviously that requires a thread that does nothing other than submit to DX in addition to the thread the driver spawns, but I imagine the 100k figure is using at least two threads as well.

So while certainly faster and impressive regardless, if that's 100% use of an 8 core machine it's somewhat less ground-breaking. Would love to see some single-core numbers with details about state changes and WDDM resource allocations (assuming Mantle's allocate maps to that) to get a proper comparison. Or better yet, just release a public beta! :)

Interesting question is, how many CPU cores you need to dedicate fully to rendering in order to submit this many draw calls? And in the long run, how the performance compares to a fully GPU-driven pipeline that completely sidesteps the draw call overhead issue (bog standard PC DirectX 11 API is enough for HUGE object counts).
Yep, we're on the same page - would love to know the answers to these questions too. I'm glad to see that they have also addressed a lot of the fundamental overhead though. My fear when hearing the initial Mantle rumors was just that they had managed to move the slow code to more cores, which is not exactly the ideal in the long run :D

By reworking the GL renderer today (maybe also using modern extensions, dunno if bindless or not), as compared to the original D3D *9* version (talking about overhead..), optimized for 2009 hardware...
They didn't use bindless or any advanced features; it is effectively a straight port.

^ Nixxes, Anderson, and even Chief Mantle ahitect mentioned in their seperate presentations that Mantle is not made only for GCN [i watched video presentations from all of them, except Oxide], it can be adapted to other vendors.
Did you read the whole last page of this thread? Spent almost the whole thing discussing these claims and implications.
 
Last edited by a moderator:
I don't think that, by now, anyone doubts that Mantle, from a pure technical point of view, can be adapted to other vendors just the way it can be done with PhysX.

Whether that matters is something else entirely.

I understand the reaction of diehard, lately Vrzone have start to do some "strangely" based article related to AMD. ( why directly start on the 6millions deal with Dice ( who was related to BF4, i can imagine AMD have put way more money for pay the work of Dice developpers on mantle ( like if in this inudstry peoples pass years on developp an API for free ( look like Vrzone think Dice have developp it in 1 month and will have use a team of developpers for develop Mantle free of charge lol , There's a difference on paying developpers for collaborate on a projects, working with you, code with you, and pay them for "just include it in a game ).

The article of VR is short, they have post only the slide of the drawcalls, when you compare their article with let say the hardware.fr one posted allready here, they dont describe Mantle, and their article is only contain 1/10 of what have been shown on other sites.. For a technical site, it is a bit light.. ( without saying they forget a lot of other "studio" who have confirm, say they are interested on Mantle. " ( the list start to be long... and it surely just the start ( using and saying they are really interested to developp their next games with it is ofc a different thing )

Proprietary is used as a strange word in all thoses article, in reality it can only work as it is today for AMD, ( different reason, who start from MS relationship and end to nvidia who just certainly certainly dream of complete games only compatible with their hardware ( Maybe the reason why discussion between developpers and Nvidia have end so fast about something similar ) .. as from a technical point of view, the only part who cant be ported to another vendor is the relation with the architecture, the rest is easy, Nvidia have just to start on half the way taking mantle and then code the rest for their architecture. Its not really a full port, but the core work is allready done, its not so much different of using AMD optimisations extensions + Nvidia extensions ( nvAPI ) when developping on OpenGL.. you have a core API, and then you need adapt it to 2 different extensions setup. As for Intel, today they are really more interested to leverage their own instructions set they can use on both their gpu and IGP ( AVX and Grind 2 as an example ). ...
 
Last edited by a moderator:
It does and it's certainly faster than DX, but the 2-10k thing is slightly unrealistic as well. I re-ran some tests and on NVIDIA w/ just the immediate context you can get ~40-80k depending on which state changes you do (including updating some dynamic constants for each w/ discard). Obviously that requires a thread that does nothing other than submit to DX in addition to the thread the driver spawns, but I imagine the 100k figure is using at least two threads as well.
There is nothing unrealistic in the figures mentioned.
Today we're seeing the vast majority DX11 games becoming CPU-limited around 3-10K/frame @ ~60fps. Even 10K is quite rare to see at this performance to be honest. Your mileage will vary depending on how powerful your CPU is but obviously there is a wide range of CPUs out there with varying degrees of performance. I would certainly not consider your personal experiment to be reflective of how effective games are at pushing draw calls in real world scenarios.
 
Today we're seeing the vast majority DX11 games becoming CPU-limited around 3-10K/frame @ ~60fps.
Yes, but they are usually doing a lot more on that thread than just submitting DX commands, and the commands they submit are often sub-optimal. i.e. there is often performance left on the floor that could be optimized.

Your mileage will vary depending on how powerful your CPU is but obviously there is a wide range of CPUs out there with varying degrees of performance.
Sure, but I don't imagine the Mantle figures were given on a dual core CPU either, right? :) I guess that's really the question - is the claim here that any application with "typical" levels of optimization (if such a thing is even possible in Mantle) will get 100K batches easily on a "wide range of CPUs" in a full game, or is that an optimized application on a high-end CPU will? If the latter, I claim that you can do better than 3-10k with DirectX as well, at least on NVIDIA's driver (there is a fair variance between drivers from different IHVs in this respect).

In any case I'd love to see somewhat more clear details on what is being measured in each case (which state changes and so on). My only point was that if these numbers were from benchmarking optimized Mantle tech-demo code then the equivalent DX code does better than quoted as well.

Ultimately I guess the proof will be in the pudding when we get Mantle applications with good DX paths to compare.
 
Last edited by a moderator:
All the claim of performance so far that i have seen have been made on GPU limited situations, not CPU bound limited performance dicted by threads or cores ... we are not on the situation of a software optimised for "real "multi threading" running on 8 cores vs 2 cores threaded .. you redirect the " better call drawing " to allow other usage, and increase the weight and leverage the performance of other district ( hence why it is hard to release a line in the sand about performance ) .... ( ofc by optimizing the threading allocation, you ofc favoir higher threads / core processors too, but the gain should be more effective on lower end cpu )

When you are speaking about fair variance on driver, you enter too the point of Mantle, dont choose only one driver ... ( if Nv release something close to Mantle, then you will compet on the same level, but todays, with games, they only compet on driver level vs arch level API optimisations, this is why you get Blizzard games who run so faster on Nv hardware, or Assassin Creed, or BF, Dirt who run so fast on AMD cards--- with low level API, you override this possibility, Mantle is too here because today, marketing is dictate performance on graphics engine for thoses brands.... ( or game ).

If tomorrow, you get an API like Mantle for Nvidia and one for AMD, all thoses bullshit of marketing, of bad optimisations for games studio against or for a brand are over.. because the games will need been developped for both API, and simultaneously, the API is developped for both hardware. If they dont play this game, they shoot a bullet in their feet... because they have all the tools in their hand..
 
Last edited by a moderator:
I did extensive DX11 draw call cost analysis several years ago with my Core 2 Quad + Radeon 5950. The result was that this setup can submit around 30k draw calls (60 fps) in optimal conditions (simple draws, prototype code, linear cache-optimal list of work on CPU side, no game logic). Only a single constant buffer was updated between the calls, and all state/shader stayed identical. Most engines batch by stader/state, so these are realistically only updated for every 10th draw call or so. I also didn't benchmark changing textures between draw calls, since our engine has been using virtual texturing for the last 4 years now (same textures are kept for all draws, not changing in between). Results are identical for a single core case, and for using all four CPU cores to push data to DX deferred contexts. DirectX deferred contexts were a huge let down, and didn't bring the promised gains in multithreaded rendering (zero gains were measured). I am glad that Mantle finally is going to fix this problem.

According to my knowledge, the 10k draw call number (for exceptional developers) in Mantle slides is an accurate estimate for a PC game. You might reach 30k draw calls with PC DirectX when you are doing only simple things (you have a very simple material/animation model in a prototype), but real engines usually use more than one constant buffer for an object, shader count can be quite large (usually leading to many other GPU state changes as well). The funny thing here is that the 10k number was true for 7 year old DX10 cards, it's was still true for 4 year old DX11 cards, and it still holds for brand new DX11.2 hardware. DirectX is quite a big bottleneck in this regard, and hardware improvements (Core2->Haswell + VLIW5->GCN) haven't changed things much.

It's good to notice that AMD tries to please both "camps", those who still want to submit draw calls using mainly the CPU in their future engines, and those who are experimenting with fully GPU-driven pipelines. There are many features in Mantle for both of these "camps".

However I noticed that the single most important feature for GPU-driven rendering development was missing from the Mantle slides. The GCN hardware surely supports this feature, as AMD was a big contributor in the brand new OpenCL 2.0 specification, and it has this feature...

Dynamic Parallelism
Device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks.

Pretty please, include this feature in Mantle. It would make me very happy :)
 
The second is what Jurjen from Nixxes mentioned in his presentation: no stuttering due to runtime shader compilation or similar events that would usually be triggered by the DX runtime. (Some of it can be alleviated by pre-warming your shader cache in DX but in practice catching all shader permutations is difficult for an engine).
Definitely. Also developer controlled memory management should reduce stutters. In our older game we created/deleted lots of small resources (constant buffers) during the game, and periodically the GPU driver freaked out and spend 40+ ms doing something (blocking the rendering thread completely). I suspect the driver was reordering the memory contents, trying to keep things clean. With Mantle we could have implemented the constant buffer handling in a similar way we did it for the consoles (sidestepping to need to create/delete any resources and saving memory by not having to care about constant buffer object memory alignment).
The third is multi adapter support, namely the ability for more than one GPU/APU to process graphics or compute workload for the same frame (as opposed to current Alternate Frame Rendering Multi-GPU solutions that increase input lag).
This is something that I have personally hoped to have in DirectX for a long long time. I don't want to hurt my single card performance (and console performance) by rejecting all data reuse optimizations in order to make AFR work for the 1% minority. With direct control of both cards (and all the data movement between them) our games should finally start seeing frame rate improvements for dual card setups. Even our old DX9 games didn't scale properly, because back then we used R2VB extensively (BTW awesome feature extension by ATI) to mutate our vertex buffers (and that vertex data was used during multiple frames).
 
You might reach 30k draw calls with PC DirectX when you are doing only simple things (you have a very simple material/animation model in a prototype), but real engines usually use more than one constant buffer for an object, shader count can be quite large (usually leading to many other GPU state changes as well).
I can vary all of these things in my testing easily. On the high end you can definitely get 80k if you update one or two dynamic constant buffers, and even if you additionally change some textures and shaders you can still get 20k without too much trouble. Like I hinted though, this varies a fair bit between drivers from different IHVs, and you'll note I quoted NVIDIA numbers :)

I'm not claiming that this is achievable on all GPUs/CPUs or anything, but I doubt the Mantle numbers were taken on low-end CPUs either, and they definitely require GCN :) I'm just curious if this was truly comparing full games on similar hardware (or tech demos to tech demos) and not games with varying levels of optimization that have been targeted at a wide range of hardware vs. Mantle running on something high-end.

Don't get me wrong, I'm not trying to downplay the improvement (and obviously any improvement is very welcome!). I'm just looking for more details to understand how it might manifest itself across various classes of hardware. Particularly it would be really nice to see a single core Mantle to DX comparison on equivalent state.

DirectX is quite a big bottleneck in this regard, and hardware improvements (Core2->Haswell + VLIW5->GCN) haven't changed things much.
No argument there, but there have been a few minor improvements. For instance, one can update constants in a small number of big batches then use one or two simple binds w/ offsets to set the constants for a given draw (note that my tests were not doing this as I wrote this code before that feature, but it makes a non-trivial difference). You effectively will end up doing that in Mantle, so it's not as if it's more work to do it in DX as well. Frostbite does make use of this on Windows 8+ I believe, so it should be a reasonable comparison.

Pretty please, include this feature in Mantle. It would make me very happy :)
Yeah although I have to say... the CL version is a huge disappointment compared to CUDA. Spawn/fork without sync/join is closer to a fancy dispatch indirect than true nested parallelism :S Sync is the hard part of the two. Oh well... there's always OpenCL 3.0 I guess ;)
 
Last edited by a moderator:
I want that API so I can resume working on my 3D engine, it wasn't worth working around OpenGL and D3D limitations so I just halted development, now that there's something that will let me do what I want (control the GPU almost like I can control the CPU) things are going to get quite fun ! :)

So where's the SDK url ?!?
 
^ Nixxes, Anderson, and even Chief Mantle ahitect mentioned in their seperate presentations that Mantle is not made only for GCN [i watched video presentations from all of them, except Oxide], it can be adapted to other vendors. Shame that internet journalists choose to ignore that fact.
Considering AMD is ignoring their older architecture, there probably is a catch somewhere.
 
It could be that it requires a GPU whose page tables have been designed to align with the x86 page tables, and the design might need to be designed and validated to be able to pull commands from the buffers in system memory space.
Prior GPUs weren't able to do so, or if there was some hardware capability it was never exposed.

That leaves GCN as the architecture that is able to do so, currently.
Intel is promising GPU page tables that are closer to x86 in Gen 8.
The exact time frame for Nvidia isn't clear.
 
It may just be the descriptor sets/bindless stuff that requires GCN. It's not clear they could have supported that before (although I haven't read the 4xxx series specs in detail).
 
Last edited by a moderator:
That's great news. I honestly expected Mantle support from major studios, but not from Rebellion. It seems like a very good sign for general adoption.
 
I'm surprised by the public adoption rate so far from studios honestly. This is really great news! All these engines that span many types of games for studios make it a big draw card to get the most out of them without having to modify every single game.
 
Back
Top