AMD Mantle API [updating]

Silent_Buddha · Feb 6, 2015

silent_guy said:
Without such a clause, company owners/shareholders could easily decide not to sell to Microsoft for whatever reason.

Which has happened before. With a company that I cannot remember the name of, opting to take less money from another company rather than accepting Microsoft's more generous offer.

Davros said:
The clause as Eastman quoted it doesnt state "that if someone offers to buy nvidia then they have to sell to ms if ms is prepared to buy"

I believe the contract gave Microsoft the right of first refusal. Meaning that as long as Microsoft matched whatever offer was on the table, the company (or was it certain assets of Nvidia?) legally had to be sold to Microsoft. It's been a long arsed time since I paid attention to it, but that's what I recall.

Regards,
SB

liquidboy · Feb 7, 2015

impressive results from the mantle/dx12 preliminary tests by Ryan ...

I have a question, not sure if anyone here can answer it (due to NDA) ...

I know that Dx12 is tied to wddm2.0 which is tied to windows 10 ..

And as Ryan points out, wddm 2.0 is an enabler of the perf gains that DX12 delivers (on the win10 os)..

WDDM 2.0 is based around enabling DirectX 12, adding the necessary features to the kernel and display drivers in order to support the API above it. Among the features tied to WDDM 2.0 are DX12’s explicit memory management and dynamic resource indexing, both of which wouldn’t have been nearly as performant under WDDM 1.3. WDDM 2.0 is also responsible for some of the baser CPU efficiency optimizations in DX12, such as changes to how memory residency is handled and how DX12 applications can more explicitly control residence

Ryan also rightfully points out that wddm 2.0 also is an enabler for the 3rd party graphics drivers, they're performance is also tied to wddm2.0

The overhauling of WDDM for 2.0 means that graphics drivers are impacted as well as the OS, and like Microsoft, NVIDIA and AMD have been preparing for WDDM 2.0 with updated graphics drivers. These drivers are still a work in progress, and as a result not all hardware support is enabled and not all bugs have been worked out

So my question is " just like Dx12.0 impressive results is tied to wddm 2.0, is it fair to say that mantles impressive results is also tied to wddm 2.0" ... basically without wddm2.0 mantle wouldn't achieve such impressive results?!

If we ran these same tests in windows 7 (wddm 1.3) with mantle, would it be able to achieve such impressive 143% - 300% improvements?!

Similarly on windows 10 if it were still using wddm 1.3 would mantle still see these impressive 100+% results ?!

Ryan Smith · Feb 7, 2015

3dilettante said:
Here's a preview comparison between DX12 and Mantle.
http://www.anandtech.com/show/8962/the-directx-12-performance-preview-amd-nvidia-star-swarm/6

Is there a case for pouring resources into a 6% performance gain, best case? How about the markedly inferior batch submission latency?
Or you could develop for DX12 and also give gamers with Nvidia cards good performance, especially GTX 980 owners in earlier pages.

I have an update from Oxide on this matter. The following has been appended to the article:

Update: Oxide games has emailed us this evening with a bit more detail about what's going on under the hood, and why Mantle batch submission times are higher. When working with large numbers of very small batches, Star Swarm is capable of throwing enough work at the GPU such that the GPU's command processor becomes the bottleneck. For this reason the Mantle path includes an optimization routine for small batches (OptimizeSmallBatch=1), which trades GPU power for CPU power, doing a second pass on the batches in the CPU to combine some of them before submitting them to the GPU. This bypasses the command processor bottleneck, but it increases the amount of work the CPU needs to do (though note that in AMD's case, it's still several times faster than DX11).

This feature is enabled by default in our build, and by combining those small batches this is the likely reason that the Mantle path holds a slight performance edge over the DX12 path on our AMD cards. The tradeoff is that in a 2 core configuration, the extra CPU workload from the optimization pass is just enough to cause Star Swarm to start bottlenecking at the CPU again. For the time being this is a user-adjustable feature in Star Swarm, and Oxide notes that in any shipping game the small batch feature would likely be turned off by default on slower CPUs.

3dilettante · Feb 7, 2015

Interesting. I had seen discussion about what it would take to swamp the command processor with targeted benchmarks, but this may be the the most prominent and complex software to manifest that limit so clearly.
Does it seem reasonable to read into that statement that Mantle would be even closer to DX12 if it weren't for that API-specific optimization?

Nvidia's path has better submission latency and also doesn't bottleneck as readily in CPU, which wouldn't be out of line with the historical trend that AMD's GPUs become limited somewhere in the driver and front end more quickly.

liquidboy · Feb 7, 2015

The way i interpret that "update" to the article ...

once the CPU bottleneck is removed , a new bottleneck emerges the "GPU Command Processor" ...

So how does Mantle/DirectX12 overcome this bottleneck?

1. Mantle - introduced a "Driver" optimization called "OptimizeSmallBatch" that combines many small batches, all on a second pass on the CPU, before submitting to the GPU ...
2. DirectX12 - doesnt' sound like it did anything to optimize for "Small Batches" hence why "Mantle path holds a slight performance edge over the DX12 path on our AMD cards"

So totally speculating here BUT another approach to overcoming this "GPU Command Processor" bottleneck is to "add more GPU command processors" ... Keep the opimization out of the "driver", keeping it thin/light.. Just fix the problem in HW ...

3dilettante · Feb 7, 2015

I think Mantle did not introduce the batching optimization, rather Oxide's Mantle path has an optional software optimization that coalesces very small batches.

The small batches problem is a giant reason why AMD used Oxide as a marketing tool for Mantle.
It just turns out that once things are opened up, that while much better than DX11, AMD is not the best at the use case it championed.

I think it's somewhat ironic, given how stridently Oxide was opposed to such optimizations that it made an engine that was almost pathologically sub-optimal, that it babied GCN (just a little) by compromising its own code of ultimate developer freedom.

Ryan Smith · Feb 7, 2015

3dilettante said:
Interesting. I had seen discussion about what it would take to swamp the command processor with targeted benchmarks, but this may be the the most prominent and complex software to manifest that limit so clearly.
Does it seem reasonable to read into that statement that Mantle would be even closer to DX12 if it weren't for that API-specific optimization?

Correct. 290X performance dips from 45fps to 39fps if you disable the batch optimization. But submit times also dip from 9ms to 4.4ms, which is comparable to DX12.

http://images.anandtech.com/graphs/graph8962/71461.png
http://images.anandtech.com/graphs/graph8962/71462.png

3dilettante said:
I think Mantle did not introduce the batching optimization, rather Oxide's Mantle path has an optional software optimization that coalesces very small batches.

Also correct. The batch optimization feature is built into the application. The driver is just a thin layer that has little control, this is what low-level APIs are all about.

liquidboy · Feb 7, 2015

Thanks for the clarification, my mistake the optimization was indeed done by the engine devs ... Keeping the low-level api's thin ...

It's great that an engine can do these optimizations as they see fit and not be dictated to by the driver ...

My speculation still stands, that ultimately it would be great if this "command processor" bottleneck didnt exist in the first place .. BUT if it didnt then the bottleneck would move to another area etc.etc..

eastmen · Feb 7, 2015

the bottleneck would move to another area but then they can add more tasks to your cpu improving another part of the game.

Serious pc gamers have to be at the point that we have at least quad core cpus with intel now making hex and octacores affordable

3dilettante · Feb 7, 2015

liquidboy said:
Thanks for the clarification, my mistake the optimization was indeed done by the engine devs ... Keeping the low-level api's thin ...

It's great that an engine can do these optimizations as they see fit and not be dictated to by the driver ...

Any engine can does this, on any API. This is a very watered-down version of the batching and coalescing done by engines or artists operating under standard APIs. This is a smaller instance of the small batch problem that was one of the bugbears Mantle--and the PR centerpiece Oxide--was meant to defeat.
One of those small batches was a petard, and Mantle was slightly hoisted by it.

My speculation still stands, that ultimately it would be great if this "command processor" bottleneck didnt exist in the first place .. BUT if it didnt then the bottleneck would move to another area etc.etc..

There's going to be a bottleneck somewhere, unless the whole system is perfectly balanced. Then the whole thing is a limiter.
In reality, it takes a fair amount of work to consistently bottleneck at the command processor, particularly when it's hidden behind a thick API.

The command processor is an unglamorous pieces of front-end infrastructure that wasn't elaborated on much when GCN came out. It's not sexy, and that unsexy domain hasn't had many notable changes publicized.

Jawed · Feb 7, 2015

3dilettante said:
Any engine can does this, on any API. This is a very watered-down version of the batching and coalescing done by engines or artists operating under standard APIs. This is a smaller instance of the small batch problem that was one of the bugbears Mantle--and the PR centerpiece Oxide--was meant to defeat.
One of those small batches was a petard, and Mantle was slightly hoisted by it.

What makes you think NVidia hardware doesn't have the same problem?

There's no Mantle path on NVidia, so there's no opportunity for the developer to optimise for the smallest batches.

How do you know NVidia isn't coalescing in its driver?

The command processor is an unglamorous pieces of front-end infrastructure that wasn't elaborated on much when GCN came out. It's not sexy, and that unsexy domain hasn't had many notable changes publicized.

Except in XB One, apparently...

Osamar · Feb 7, 2015

Excuse Ryan.
If I am not wrong GTX 980 can see all the system memory and R9 290X "just" 8 Gb. Could you tell us what happens with the rest of the cards?

Alexko · Feb 7, 2015

Jawed said:
What makes you think NVidia hardware doesn't have the same problem?

There's no Mantle path on NVidia, so there's no opportunity for the developer to optimise for the smallest batches.

How do you know NVidia isn't coalescing in its driver?

Based on the submission times, if NVIDIA's driver is coalescing batches, it's doing it very, very fast.

Ryan Smith · Feb 7, 2015

Osamar said:
Excuse Ryan.
If I am not wrong GTX 980 can see all the system memory and R9 290X "just" 8 Gb. Could you tell us what happens with the rest of the cards?

Each card is different. GTX 680 was 17.9GB, and 285/260 were 5.7GB. AMD seems to allocate 4GB of virtual RAM on top of the physical VRAM, whereas NV allocates 16GB.

3dilettante · Feb 7, 2015

Jawed said:
What makes you think NVidia hardware doesn't have the same problem?

Presumably, there is a limit for the processing rate of Nvidia's command processor. The 980 has sub-Mantle submission times and does not bottleneck at two cores, so it is at least higher than what Oxide can show.
Nvidia's GPU clocks in the latest generation are higher, which could give a linear boost over the 290X. There could be architectural differences in the RISC cores or the surrounding hardware in the command processor block between the two architectures.

How do you know NVidia isn't coalescing in its driver?

I don't, but if it is, it is doing so while maintaining submission times that are as good or better than AMD's DX12 path. 30-40% of the submission latency for Mantle appears to be attributable to coalescing.
Something at a hardware or software level for Nvidia would have to be an order of magnitude better to obscure that, which seems unlikely.

Except in XB One, apparently...

Do you mean the dual command processors that as of the last known revelations only allows one to be used by the developer, or the unspecified improvements alluded to by Microsoft--in a generation where IP found in both consoles is embedded in Bonaire and Hawaii?

Carrizo is the generation of GCN that might promise a bigger change in the front end, since it is promising preemption of the graphics context the command processor manages. I do not think that alone would help this problem, but it may be a sign that the architecture has gone through more design refinement in an area that until recently did not need to reach for peak performance due to being hidden by a thick API.

Davros · Feb 7, 2015

Ryan Smith said:
AMD seems to allocate 4GB of virtual RAM on top of the physical VRAM,

What would happen if a game needed more memory ?

Jawed · Feb 7, 2015

3dilettante said:
Presumably, there is a limit for the processing rate of Nvidia's command processor. The 980 has sub-Mantle submission times and does not bottleneck at two cores,

Agreed.

so it is at least higher than what Oxide can show.

Cannot be asserted, because you don't know whether the driver is coalescing.

Nvidia's GPU clocks in the latest generation are higher, which could give a linear boost over the 290X.

Agreed.

There could be architectural differences in the RISC cores or the surrounding hardware in the command processor block between the two architectures.

I don't know what RISC cores you're referring to. I don't know the architecture of Maxwell's command processing. Is there parallelism there for command processing?

Are we seeing a combination of a parallelised command processor + coalescing?

I don't, but if it is, it is doing so while maintaining submission times that are as good or better than AMD's DX12 path. 30-40% of the submission latency for Mantle appears to be attributable to coalescing.
Something at a hardware or software level for Nvidia would have to be an order of magnitude better to obscure that, which seems unlikely.

I have no idea how you make the leap to an order of magnitude.

Also, remember that NVidia's driver has perfect information about the GPU's state at any time, including the operation of the command processor. Oxide engine doesn't have command processor register/pipeline-load information or a detailed CP performance model, under Mantle, hence the simplistic "coalesce off/on" switch.

Do you mean the dual command processors that as of the last known revelations only allows one to be used by the developer,

"only allows one to be used by the developer". Let's see if there really is a second processor and if it's any use for games... It's why I used the word "apparently".

Carrizo is the generation of GCN that might promise a bigger change in the front end, since it is promising preemption of the graphics context the command processor manages. I do not think that alone would help this problem, but it may be a sign that the architecture has gone through more design refinement in an area that until recently did not need to reach for peak performance due to being hidden by a thick API.

I'm finding that OpenCL kernels (in a variety of applications, some I've coded) running in the background under Catalyst Omega have a severe impact on desktop responsiveness. Older drivers (multiple) had significantly less impact and were no worse in performance (and in some cases better).

So it seems to me that AMD's recently moved in the wrong direction with regard to driver interaction with GPU command processing. I'm not making excuses, merely pointing out that there's plenty of room to manoeuvre.

It would be interesting to compare CPU load in a D3D12 benchmark like this where the GPU framerate is the same on both AMD and NVidia (e.g. with a locked 20fps, say), i.e. the application's draw call rate is known and we can then observe whether one driver is doing more work on the CPU.

3dcgi · Feb 8, 2015

liquidboy said:
My speculation still stands, that ultimately it would be great if this "command processor" bottleneck didnt exist in the first place .. BUT if it didnt then the bottleneck would move to another area etc.etc..

Indeed. Draw calls that are very small is a bad idea because there are other bottlenecks right behind the command processor. Partial wavefronts on AMD hardware for one. Probably partial warps on Nvidia too.

3dilettante · Feb 8, 2015

Jawed said:
I don't know what RISC cores you're referring to. I don't know the architecture of Maxwell's command processing. Is there parallelism there for command processing?
Are we seeing a combination of a parallelised command processor + coalescing?

Internally, the front ends run microcode with custom RISC or RISC-like processors. This was stated as such in the VLIW days in a few AMD presentations, and it is also mentioned here http://www.beyond3d.com/content/reviews/52/7.
Later resources like the Xbox SDK discuss a microcode engine, but they do mention a prefetch processor that feeds into it. How new that is, I'm not sure.
There are vendors for inexpensive licensable and customizable RISC cores for this sort of thing. The other command front ends likely have something similar, possibly with feature support for the graphics-specific functionality stripped out.
Various custom units probably have them as well, like the DMA engines. UVD and TrueAudio are the most obvious examples of licensed Tensilica cores, though I forget who is the go-to company for the sort of cores used in the command processors.
It looks like the processor read in a queue command, pulls in the necessary program from microcode, runs it, then fetches the next command. Within a queue, this seems pretty serial. There is parallelism for compute, at least between queues but not within one.
The ACEs are able to be freely used in parallel, and those are a subset of command processing. I guess the more complex graphics state hinders doing the same for graphics, although it seems AMD has been moving towards tidying it up if preemption is coming.

Nvidia's solution was not so plainly described as such, but it has to do the same things, so there's some kind of simple processor hiding inside of the front end. AMD and Nvidia at this time do make non-RISC cores, but simple cores have sufficed until recently.

I have no idea how you make the leap to an order of magnitude.

I admit that was probably overestimating it for rhetorical effect.
The 980's submission time is 3.3 ms, while the 290X is 4.8 and 9.0 for DX12 and Mantle, respectively.
4.2ms of Oxide's work is somehow being fit into a fraction of that 3.3ms, and a large fraction of that is going to be devoted to the actual submission process rather than analyzing batch contents.
Nvidia would be very good at doing what Oxide is doing, or its inherent overhead is low enough to give that much leeway for the analysis, or some combination of the two.

"only allows one to be used by the developer". Let's see if there really is a second processor and if it's any use for games... It's why I used the word "apparently".

I'm pretty sure there is. Whether it can be readily exposed to games, I'm not sure. It would have benefits in a system that is split into two guest VMs.

So it seems to me that AMD's recently moved in the wrong direction with regard to driver interaction with GPU command processing. I'm not making excuses, merely pointing out that there's plenty of room to manoeuvre.

It is unfortunate that this is still such an acute problem, given how important freely mixing different workloads is to AMD's vision of the future.

It would be interesting to compare CPU load in a D3D12 benchmark like this where the GPU framerate is the same on both AMD and NVidia (e.g. with a locked 20fps, say), i.e. the application's draw call rate is known and we can then observe whether one driver is doing more work on the CPU.

If the AI could be ramped, it could also be approached from the direction of loading the CPU until the frame rate suffers. It doesn't seem like anything happens with all the core cycles that get freed up.

Ryan Smith · Feb 8, 2015

Davros said:
What would happen if a game needed more memory ?

It's up to the application. It will do whatever its programmers told it to do if it runs out of memory (keep in mind that the application knows how much memory is available beforehand, so it's never a surprise). I have to imagine if you're absolutely out of memory after 2+4GB, it's probably not in your best interest to keep going.

AMD Mantle API [updating]

Similar threads