AMD Mantle API [updating]

AFAICT, Mantle only handles CPU side command submission and will not expose GDS, for instance.
GDS is exposed by an extension in OpenCL on GCN. I can't think of a reason why Mantle wouldn't too.

Why do you come to the conclusion that "Mantle only handles CPU side command submission"?

OCL2.0 has primitives for Queues. I wonder if AMD will map them to RAM or some on die SRAM.
Putting queued data in RAM versus on-die storage amounts to a huge difference in throughput. It's the difference between a GPU implementing an alternative rendering pipeline and a demonstration that such an alternative pipeline is possible. The latter is only interesting for research purposes if you have access to the former.

It's like the early days of GPU compute, in which OpenGL pixel shaders were wrangled to do something that wasn't pixel shading. It works, yes. But it's not the same as full blooded compute.

Consoles have it easy, they can assume only one game is running at a time. A pc gpu has to be open to the possibility of more than 1 gpu workload using it. So virtualizing on die SRAM becomes tricky. It would be great if something like it were exposed though.
I don't understand why it's tricky - it's a block of memory with a different policy (compared with L2 cache, say). And until WDDM/D3D catches up and embraces such memory, there's no need.

Mantle, at the very least, brings back some of the excitement that Larrabee engendered. I'm still mad at Intel for killing that off.
 
GDS is exposed by an extension in OpenCL on GCN. I can't think of a reason why Mantle wouldn't too.

Can you remind me what that extension is? Last time I asked, the only things I heard were that certain kinds of atomics were placed in GDS, and I'd expect the OpenCL 2.0 pipes would also be placed in GDS. But that's not really exposing GDS - both of those things are very restrictive uses of GDS, so I don't think they count.

The issue is that really using GDS fully would require a global barrier in OpenCL. That's a far bigger extension, and I haven't seen that either.

Am I overlooking something?
 
GDS is exposed by an extension in OpenCL on GCN. I can't think of a reason why Mantle wouldn't too.
AFAIK, GDS on ocl exposes only counters, not block RAM.
Why do you come to the conclusion that "Mantle only handles CPU side command submission"?
Primarily from the optics of the reveal. They highlighted that they can get more draw calls, true parallel command submission. It does not appear that they will be exposing GPU side stuff, at least in the beginning. Or I am reading too much into it.

Putting queued data in RAM versus on-die storage amounts to a huge difference in throughput. It's the difference between a GPU implementing an alternative rendering pipeline and a demonstration that such an alternative pipeline is possible. The latter is only interesting for research purposes if you have access to the former.

I don't understand why it's tricky - it's a block of memory with a different policy (compared with L2 cache, say). And until WDDM/D3D catches up and embraces such memory, there's no need.

I guess the tricky points come when the GPU becomes a system level co-processor. How do you handle multiple kernels using gds, when some of them might be context switched out any time? What if you were doing producer consumer and one of them is context switched? I am not sure what would be the right way of handling such situations.
 
Can you remind me what that extension is? Last time I asked, the only things I heard were that certain kinds of atomics were placed in GDS,
Those.

and I'd expect the OpenCL 2.0 pipes would also be placed in GDS.
Unless GDS expanded into the MB, then tens of KB really can't do much more when you have thousands of work items in flight at any point in time: 4 or 8 bytes of globally shared on die memory per work item isn't much use for a pipeline buffer (you'll only get indices to off-die data or very compact structures in the queue - no good if a workgroup produces 100KB of data).

But that's not really exposing GDS - both of those things are very restrictive uses of GDS, so I don't think they count.
Which is why I'm alluding to GDS plus MB of on-die memory as being the killer feature of XBone.

The issue is that really using GDS fully would require a global barrier in OpenCL. That's a far bigger extension, and I haven't seen that either.

Am I overlooking something?
With producer-consumer you don't use global barriers. I think taking the G, "global", in GDS literally (as a global analogue of LDS) is starting off on the wrong foot. e.g. if you have an algorithm that depends on an intermediate kernel that does pruning on a data-structure produced by one kernel and consumed by another, you run these kernels simultaneously and use GDS atomics to manage the queue and get performance from on-die storage of the intermediate data. (Tis a pity that this doesn't work on current GCN cards, but there it is.)

GPU workload management (amongst three kernels, say) is then driven by queue-derived metrics.

I have exactly this problem on a pair of kernels I'm working on right now - my intermediate data is too large for LS, but an on-die queue of even just a MB would be perfect with these kernels working as producer-consumer. Especially as the producer has a throughput of 1/100th of the consumer, roughly. But I would also need each work item to be able to read/write its blob within the queue. You could argue that last point is exactly why we need a multi-MB GDS. The problem is you then lose fast global atomics (requiring a trade-off of banking versus logic to perform the gamut of atomic operations on multiple MB of data). So an architecture of GDS + large on-die RAM is the sweet spot. Exactly what we see in XBone.

Honestly I'm not sure what kind of throughput-centric algorithms could make meaningful use of global barriers coupled with a small GDS (as it currently is) that wouldn't be better as producer-consumer (though small GDS + large on-die memory is where the fun starts). What am I missing?

If there was a way for GCN on desktop to lock L2 cache lines, to provide a fixed, large, on-die memory, that would be cool. It wouldn't be multi-MB, but it would be a start.

I'm intrigued by the problem you're considering where 64KB of GDS with support for a global barrier would be an effective solution. A single multi-modal (producer-consumer) persistent kernel dealing in small primitives is the only thing I can think of with current architectures. L2 caching on Kepler should give you something as fast as GDS - I honestly don't know what kind of performance GCN L2 cache would give you in this scenario.
 
Primarily from the optics of the reveal. They highlighted that they can get more draw calls, true parallel command submission. It does not appear that they will be exposing GPU side stuff, at least in the beginning. Or I am reading too much into it.
Sorry, I took your original statement to mean something incorrect, I suspect: as a general dismissal of intra-GPU capabilities rather than solely a reference to spawning work.

I guess the tricky points come when the GPU becomes a system level co-processor. How do you handle multiple kernels using gds, when some of them might be context switched out any time? What if you were doing producer consumer and one of them is context switched? I am not sure what would be the right way of handling such situations.
Well I dare say, that's what L2 (ultimately backed by off die memory) is for. You could stripe GDS across the L2 partitions to maximise responsiveness during a swap (?). The GPU can track which kernels have GDS instructions (or which pages of kernel instructions have GDS instructions), so it can pre-swap GDS<->L2 as kernels switch. If multiple live kernels are GDS heavy, well, you might find your algorithm works as well using global atomics rather than GDS atomics.

If you're referring to multiple-contexts sharing a large on-die memory, you have the same. Only paged. And slow if heavily contended.

Whitepaper said:
The tessellator in GCN is up to 4× faster than the earlier generation, with larger parameter caches that can spill to the coherent L2 cache instead of the much slower off-chip memory
 
I think ESRAM in XBone makes that particular comparison foolish. Unless 290X is hiding its own ESRAM, PC space with Mantle is going to struggle against that particular architectural feature.
The bandwidth provided by the eSRAM is why I didn't put the Xbox One at the level of Cape Verde, which is where its other resource counts almost put it.
Microsoft is pretty keen on telling the world how this architecture allows it to reach parity with the PS4.

For most of the workload the consoles are tasked with, the 290X is going to be very, very good at it.
For things that it isn't good at, a gaming rig will have a CPU that will outclass the console CPUs, and depending on the particular task can outclass the resources they will be able to devote to compute.

That does leave a subset of tasks that could hop the PCIe bus, if this isn't an APU+dGPU system that AMD hopes will be more common.

I'm not entirely sure where GPU compute fits in the picture for Mantle, since the big marketing references were for things like draw calls.
As far as Microsoft or Sony's reaction on compute, they had absolutely no reason to be surprised.
Sea Islands introduced the expanded ACEs and their user-level compute queues. Bonaire at least internally has it, as does Kabini, and from the looks of things Kaveri has it as well.
Neither console maker would have reason to think that AMD would add the hardware to its non-console chips and then never expose it.


Only if the consoles aren't running game code.
In terms of CPU, GPU, and disk I/O, a good rig has a lot of brute force and TDP to burn.
Even if efficiencies on the consoles were massive and Mantle didn't come around to smooth cases where modern desktop hardware has trouble, that buys them maybe one upgrade cycle in PC terms, not 7-10 years as the console makers intend.
 
Algorithms that break up into producer-consumer stages are going to change gaming on XBone, because the inter-stage buffering is going to be effectively free.
The 290X's off-chip bandwidth exceeds the likely on-chip ESRAM bandwidth of Xbone by a fairly large margin. Now I'm all for big on-chip caches going forward, but anything the ESRAM can do can indeed be brute forced through GDDR on a high end GPU. It'll use more power, but that's not a large concern for discrete GPUs at the moment.

If you're concerned about CPU->GPU transfers over PCI-E, fine, but it's tough to make arguments about Xbone's ESRAM being very useful for that considering the size of it and the fact that the GPU tends to run enough behind the CPU to make fine-grained interaction problematic. Haswell's cache is big enough that it might be more viable, but those sorts of usages still have yet to be proven.

The low level (not the directly on the metal one) programming paradigm of current GPUs is quite alike over all architectures; binary command lists fe. are entirely incompatible in their content, but every architecture does use them, so offering the possibility to manage machine programs and state is still very general conceptually. And something DirectX doesn't support, even compiled shaders are virtual machine abstract until they hit the driver.
I don't think you're drawing a meaningful distinction between what the "portable" APIs do and what you're assuming (or know?) than Mantle does. To be more clear, the UMD is DirectX is already supposed to fullfill exactly the "minimal layer to encode command buffer" purpose that you describe. So if Mantle is going to do something much superior, what changes is it going to make to the programming model that will allow that UMD layer to execute much more efficiently than it does today?

Basically I see a few "big ticket" opportunities:
1) Multithreaded submit can obviously be made better than it is today by removing some stuff that makes it awkward (Map/Discard, creation/destruction of textures on the fly, etc). I imagine this is part of where their "9x" number comes from, but honestly if all you're doing is moving relatively slow code to more cores, that's not a particularly compelling story in the long term.
2) State blocks can be made to match a particular architecture more directly. There are still cases where drivers have to handle non-1:1 mappings of state and in some cases state being baked into shader code that are not ideal and necessitate checking those special cases all over the place. Specializing this for one architecture definitely helps, but it makes it less portable of course...
3) Move fine grained resource stuff (creation/destruction, hazard tracking, etc) out of the driver and into user space. This is likely the biggest potential for real improvement, but it's less an API issue than an OS issue. If you go down this road, you start to have to lie to the OS memory manager and that can have a variety of consequences to the user experience. i.e. ultimately this one needs to be solved in the OS too.

These things are really only going to improve the CPU overhead. AMD hasn't really said that they expect GPU-side performance improvements so it's possible that they indeed do not expose additional features there, but a lot of people have been talking as if they expect that side to go faster too. So one of the two groups is wrong :) Furthermore, there's nothing really on the GPU side that can't be done with GL/DX extensions so that's hardly an argument for needing a new API.
 
Last edited by a moderator:
I would wait for november before starting the condescending talk.

From whom? I'm just waiting until tomorrow morning until I've enough sobriety to contribute to this conversation (in whatever tiny way I can of course)! These guys opinions are exactly what me and you are looking for so don't knock it, just appreciate it mate and feel privileged that you're part of it ;)
 
Honestly I'm not sure what kind of throughput-centric algorithms could make meaningful use of global barriers coupled with a small GDS (as it currently is) that wouldn't be better as producer-consumer (though small GDS + large on-die memory is where the fun starts). What am I missing?

I'd much rather AMD beef up the L2 in front of the memory controllers and stack edram behind them. Even 290X, with it's 512 bit bus will only have 1MB of L2.

At 20 nm (next year?), 8MB sram should be about 16 mm2, cheap for a 350mm2 chip. Going that way is much much better than a sw managed cache, imho. If we are having lots of SRAM, why not put it as a cache?
 
Well I dare say, that's what L2 (ultimately backed by off die memory) is for. You could stripe GDS across the L2 partitions to maximise responsiveness during a swap (?). The GPU can track which kernels have GDS instructions (or which pages of kernel instructions have GDS instructions), so it can pre-swap GDS<->L2 as kernels switch. If multiple live kernels are GDS heavy, well, you might find your algorithm works as well using global atomics rather than GDS atomics.

If you're referring to multiple-contexts sharing a large on-die memory, you have the same. Only paged. And slow if heavily contended.

Then why not cut out the middleman, and just beef up L2 in size and use large crystalwell, instead of a sw scratchpad?
 
Battlefield 4 and Frostbite 3 Will Support Both AMD Mantle and NVIDIA NVAPI APIs For PC Optimizations
AMD's Mantle API is currently being integrated in Frostbite 3 based Battlefield 4 which is without a doubt the most biggest title coming out this year after GTA V. Such is its fame that AMD even bundled their latest and top new Radeon R9 290X graphic card with the new title and those of you lucky enough to pre-order the GPU now would be able to redeem the game at no additional cost. So back to the API talk, currently developers have to operate through DirectX and OpenGL APIs to make games work but this doesn?t fully unleash the hardware capabilities of a PC nor do they allow ease of development to developers.

The AMD Mantle API is being exclusively developed for GCN enabled Radeon graphic cards. This would allow developers to dig deep into the metal to bring console-level optimizations through ease of programming and faster optimizations over a coherent GCN chip architecture. This means that we would see better performance on the entire GCN architecture enabled AMD graphic card lineup ranging from the top Radeon R9 290X to the bottom R7 250X.

But Frostbite 3 is more than that as John Anderson, the lead guy behind the team at DICE in developing AMD's Mantle API said on his twitter profile that Battlefield 4 would also feature NVIDIA's NVAPI support as it did in Battlefield 3. While the optimizations many not be as great with NVIDIA's API as with AMD's Mantle, its still worth noting that atleast DICE is supporting both the Red and Green team graphic cards which means PC optimizations at both ends. In addition to this, you will have the option to select between using Mantle and DirectX 11 if you are using a GCN enabled GPU. Frostbite 3 is on the road to become one of the new mammoth tech engines in the gaming industry powering a portfolio of 15 AAA titles...
qmcp.png
 
The bandwidth provided by the eSRAM is why I didn't put the Xbox One at the level of Cape Verde, which is where its other resource counts almost put it.
I'm talking about latency, not bandwidth. Think of this as a general purpose memory that, amongst other things, can do colour/z/stencil buffer caching, on a huge scale :cool:

In traditional forward GPUs, the combination of batched pixels and on-die colour/z/stencil buffer cache allows the ROPs to keep up in high fill-/blend-rate scenarios. If these (read/modify/)write operations weren't block-cached on-die, then GPUs would need far far more off-die bandwidth and far more ROPs to maintain fillrates.

In other words, ROPs are latency sensitive. It's just that pixel export needs only the tiniest amount of cache to make this operation comfortably fast.

This is why I've been making the comparison with Crystalwell. It's bandwidth/latency in combination that opens up new algorithms.

For most of the workload the consoles are tasked with, the 290X is going to be very, very good at it.
Until the second generation of games.

This will have an adoption curve similar to how it took developers a while to get used to being forced to write multi-threaded code for PS3/XB360.
 
The 290X's off-chip bandwidth exceeds the likely on-chip ESRAM bandwidth of Xbone by a fairly large margin. Now I'm all for big on-chip caches going forward, but anything the ESRAM can do can indeed be brute forced through GDDR on a high end GPU. It'll use more power, but that's not a large concern for discrete GPUs at the moment.
Not at the same latency it won't. XBone does the large, fully-programmable, GDS that RecessionCone wants.

In compute, LS (or Larrabee's L2 cache slices) provide a combination of latency and bandwidth that no amount of brute-force off-die bandwidth can replicate.

XBone's on-die memory is global rather than local. It has an intermediate latency/bandwidth profile (compared with LS and GDDR5 in 290X). It sits nicely in the middle, while 290X if it only has L2 cache, won't have enough L2 cache to make a damn difference.
 
I'd much rather AMD beef up the L2 in front of the memory controllers and stack edram behind them. Even 290X, with it's 512 bit bus will only have 1MB of L2.

At 20 nm (next year?), 8MB sram should be about 16 mm2, cheap for a 350mm2 chip. Going that way is much much better than a sw managed cache, imho. If we are having lots of SRAM, why not put it as a cache?
This is why I'm mad at Intel for canning Larrabee for consumer graphics. Well, other things too.
 
NVAPI is a small but good utility library to control display setup and get access to some 10.1 functionality on 10.0 devices. It is _not_ a graphics API. It is more comparable to AMD's AGS library, which we also use.
 
Wasn't 10.1 features what nvidia was paying developers to disable? So while paying UBI to disable 10.1 , they were adding it through a proprietary API to their own cards.
 
Last edited by a moderator:
The graphics API for NVIDIA NV1 is NVAPI. I guess it has gone through a few revisions / tranformations. ;)
 
Back
Top