AMD Mantle API [updating]

Yes, if Nvidia created their own low-level API and developers started to support both Mantle and "Glide", age of DirectX PC dominance would truly start to fade.

Now that's a good thought.

....
AMD could left some openings for expansion of API when some future different GPU architecture arrives. There is no point in creating Mantle if it will become obsolete in 5-6 years.

I'm not disagreeing with you, just pointing out the irony. There was a time when GLIDE and OpenGL ruled--3dfx vs. nVidia, respectively (and D3d had not yet been born or else was too immature to matter.) Most people railed against this and heralded D3d as the API that all IHVs could participate in, compete in, and support. Would you want to go back to the former standard? I'm not saying I wouldn't...and not saying I would, either...;)

Yes, and if D3d development had ended with DX7 then some other API would likely have usurped it years ago. But it didn't; D3d kept on progressing because of Microsoft's support and influence. There's no reason Mantle cannot continue to develop just as D3d has done for all of these years.

Here's the thing: is Microsoft getting tired of managing D3d? I think there are signs that Microsoft is getting tired of doing a lot of things it has traditionally done (which I think is a mistake, but that's another story), and D3d might just be one of those things the company would be more than happy to hand off to AMD--I say AMD because nVidia's relationship with Microsoft has been rocky for many a year and I doubt it will ever improve. It seems like most of the D3d advances have come from ATi/AMD, anyway, over the last decade. (nVidia is still deep in its proprietary cups with things like PhysX, CUDA, etc.)

However, even though nVidia stated publicly years ago that it did not officially approve of "the direction" for 3d gaming that Microsoft was charting with D3d--after it put nV3x behind it, nVidia has never had any trouble fully supporting D3d in the years since--inspired by AMD or not, as the case may be. I think that more than answers the question as to whether nVidia could adapt to Mantle as surely as it has adapted to D3d.

So, what if there's an under-the-table understanding between AMD and Microsoft that AMD will slowly take charge of the API side of the business over the next few years? After all, AMD is in the ideal spot to do so, manufacturing both consoles as it does. Microsoft has two out of three positions cornered, the PC, and the xBone vs. the PS4, so even if the PS4 does well it does not mean Microsoft won't do better. Ah...speculation at this point is premature!

But if you are a developer and you want to support xBone, PS4, and the PC through a single API, and at a low level nearer the hardware, what else is there except Mantle at the moment? That said, is Mantle any good? As others have asked, how are the developer tools and so on? I think that the idea of Mantle for AMD has legs--but whether the actual Mantle code does is a horse of a different color, certainly.
 
From my ignorance and noobism.

Mantle to me seems to be the base to the metal drivers layer. On top of it AMD have D3D and OGL, so the have decided to polish and publish like an API. Probably because XBone use it or something very similar.

If I am not wrong Nvidia use something similar in its drivers, a common base and D3D/OGL on top.

What could do Mantle not posible with OGL extensions?

We will see DirectX 12 as a high level library on top of to the metal APIs?
 
In which case, I don't understand why you are arguing for a sw managed cache over a larger L2?
I'm arguing for a wodge of globally accessible on-die fast RAM. It's coming anyway, I just expect it'll be a while and until then XBone will have an advantage with whatever developers can find to do with it.

Large L2 or something else? Unfortunately the evidence (in graphics/games) is still thin on the ground, particularly as Intel ran off with Larrabee. Why did MS go the way it did?

Was it because it's more similar than different to XB360's EDRAM? Was it AMD's preference or advice? A sort of trial by AMD to see what happens? 32MB L3 too complex/costly for this launch timeframe? etc.

I'm not defending the choice of architecture for this wodge in XBpne. It's there and unignorable. Crystalwell is a bold step, one that's theoretically always there "making everything faster" even when not coded to be dependent upon a wodge of on-die memory.
 
where would you put the L3 on a GPU, break up the L2 into multiples and have L3 being central, before the memory controllers in some kind of banked configuration, or as separate caches inside each memory controller?

The last seems the easiest to me, makes is easy to scale out the uarch but you wont get the power savings like you would if the L3 was closer to the execution units.
 
I'm arguing for a wodge of globally accessible on-die fast RAM. It's coming anyway, I just expect it'll be a while and until then XBone will have an advantage with whatever developers can find to do with it.

Large L2 or something else? Unfortunately the evidence (in graphics/games) is still thin on the ground, particularly as Intel ran off with Larrabee. Why did MS go the way it did?

Was it because it's more similar than different to XB360's EDRAM? Was it AMD's preference or advice? A sort of trial by AMD to see what happens? 32MB L3 too complex/costly for this launch timeframe? etc.
IIRC, the ESRAM can be configured as a hw managed or a sw managed cache, so those tradeoffs don't really apply.
 
I'm talking about latency, not bandwidth. Think of this as a general purpose memory that, amongst other things, can do colour/z/stencil buffer caching, on a huge scale :cool:
There are hints at some future DF article that might go into detail as to how the eSRAM is used in a compute scenario.
This might give some answers to my questions concerning the eSRAM's latency numbers.
 
Not at the same latency it won't. XBone does the large, fully-programmable, GDS that RecessionCone wants.
Sure, but there aren't a lot of graphics algorithms that are going to be able to rely heavily on that. For 3D stuff, you almost need something like pixel sync to make it useful (i.e. to make the latency relevant). In compute you could argue that if it's coherent enough for atomics and such there could be cases that could benefit a well, but people typically try to avoid writing latency sensitive GPU code because it also tends to be code that doesn't scale well to wider architectures. I agree that there's an area in the middle there with some potentially valuable algorithms, but I don't think the Xbone is going to really pull any super-fancy tricks in practice that won't be implementable on other platforms.

IIRC, the ESRAM can be configured as a hw managed or a sw managed cache, so those tradeoffs don't really apply.
Are you sure it can work as a HW cache (with what coherency)? That'd be fairly big news if so, and IMHO make it a lot more useful.
 
It's certainly good news that the transaction level API has materialized. AMD was talking about this goal for a couple of years now with their heterogeneous computing strategy if I am not mistaken.

It will be interesting to see how the iGPU resources can finally be exploited in tandem with a dGPU. But perhaps this won't really take off before the memory bandwidth gap of APUs is taken care of. The potential of interconnected DRAM ICs seems so huge and tangible to me. High cost mass production has already been accomplished AFAIK. I wonder if an investment collaboration between AMD and GLOBALFOUNDRIES regarding TSVs could really put them in an even more advantageous position there.

What are the chances for an open transaction level API? But don't we all imagine if there are no countries...
 
If anyone is interested those are my drawcalls performance tests from CE3 SDK:

Video - http://youtu.be/GrSpm2AZWVU (draw calls are listed as DP 3rd row from the top)

Results, not from video, but from a little more precise testing:
300 draw calls - 105 fps
2100 draw calls - 104 fps
3000 draw calls - 103 fps
4000 draw calls - 101 fps
5000 draw calls - 91 fps
6000 draw calls - 83 fps
7000 draw calls - 75 fps
9000 draw calls - 65 fps
13000 draw calls - 49 fps
17000 draw calls - 41 fps
20000 draw calls - 37 fps
on stock i5 2500k and GTX 560.
 
Last edited by a moderator:
If anyone is interested those are my drawcalls performance tests from CE3 SDK:

Video - http://youtu.be/GrSpm2AZWVU (draw calls are listed as DP 3rd row from the top)

Results, not from video, but a little more precise testing:
300 draw calls - 105 fps
2100 draw calls - 104 fps
3000 draw calls - 103 fps
4000 draw calls - 101 fps
5000 draw calls - 91 fps
6000 draw calls - 83 fps
7000 draw calls - 75 fps
9000 draw calls - 65 fps
13000 draw calls - 49 fps
17000 draw calls - 41 fps
20000 draw calls - 37 fps
on stock i5 2500k and GTX 560.

Will be interesting to see those performance figures on a GCN card before and after Mantle!
 
If anyone is interested those are my drawcalls performance tests from CE3 SDK:

Video - http://youtu.be/GrSpm2AZWVU (draw calls are listed as DP 3rd row from the top)

Results, not from video, but a little more precise testing:
300 draw calls - 105 fps
2100 draw calls - 104 fps
..
That's not representing modern OpenGL/Direct3D. You can easily draw 20k unique objects with <5 ms overhead when using indirect drawing and bindless resources.
 
This is not about unique objects, instancing and what not. This is about the number of actual draw calls, which require "some" housekeeping in the UMD for each and every single one (which requires CPU time, which happens to be the topic of this discussion, not the complexity of the scene).
 
@Still I don't see how you can have less DrawInstancedIndirect() than DrawInstanced() calls. Each indirection just handles 1x draw. If you had 20k normally, you have 20k indirect. Bindless resources aren't available on DirectX, even though GCN would allow you to use them extensively - with luck you might even pass them from fixed function stage to fixed function stage or store them in any place you want.
 
If you come across the source, I'd be interested in reading it.
I haven't seen that being offered, specifically.
This is also not taking into account that upgrading GPUs and untraditional processors to a peer-like status in AMD's heterogenous platform strains a number of assumptions made when using the terms of what has been a traditionally CPU-based discussion.
 
300 draw calls - 9.5ms
2100 draw calls - 9.6ms
3000 draw calls - 9.7ms
4000 draw calls - 9.9ms
5000 draw calls - 11.0ms
6000 draw calls - 12.1ms
7000 draw calls - 13.3ms
9000 draw calls - 15.4ms
13000 draw calls - 20.4ms
17000 draw calls - 24.4ms
20000 draw calls - 27.0ms

I converted that to usable time units for you, since I'm allergic to FPS.
 
Back
Top