Algorithms that break up into producer-consumer stages are going to change gaming on XBone, because the inter-stage buffering is going to be effectively free.
The 290X's off-chip bandwidth exceeds the likely on-chip ESRAM bandwidth of Xbone by a fairly large margin. Now I'm all for big on-chip caches going forward, but anything the ESRAM can do can indeed be brute forced through GDDR on a high end GPU. It'll use more power, but that's not a large concern for discrete GPUs at the moment.
If you're concerned about CPU->GPU transfers over PCI-E, fine, but it's tough to make arguments about Xbone's ESRAM being very useful for that considering the size of it and the fact that the GPU tends to run enough behind the CPU to make fine-grained interaction problematic. Haswell's cache is big enough that it might be more viable, but those sorts of usages still have yet to be proven.
The low level (not the directly on the metal one) programming paradigm of current GPUs is quite alike over all architectures; binary command lists fe. are entirely incompatible in their content, but every architecture does use them, so offering the possibility to manage machine programs and state is still very general conceptually. And something DirectX doesn't support, even compiled shaders are virtual machine abstract until they hit the driver.
I don't think you're drawing a meaningful distinction between what the "portable" APIs do and what you're assuming (or know?) than Mantle does. To be more clear, the UMD is DirectX is already supposed to fullfill exactly the "minimal layer to encode command buffer" purpose that you describe. So if Mantle is going to do something much superior, what changes is it going to make to the programming model that will allow that UMD layer to execute much more efficiently than it does today?
Basically I see a few "big ticket" opportunities:
1) Multithreaded submit can obviously be made better than it is today by removing some stuff that makes it awkward (Map/Discard, creation/destruction of textures on the fly, etc). I imagine this is part of where their "9x" number comes from, but honestly if all you're doing is moving relatively slow code to more cores, that's not a particularly compelling story in the long term.
2) State blocks can be made to match a particular architecture more directly. There are still cases where drivers have to handle non-1:1 mappings of state and in some cases state being baked into shader code that are not ideal and necessitate checking those special cases all over the place. Specializing this for one architecture definitely helps, but it makes it less portable of course...
3) Move fine grained resource stuff (creation/destruction, hazard tracking, etc) out of the driver and into user space. This is likely the biggest potential for real improvement, but it's less an API issue than an OS issue. If you go down this road, you start to have to lie to the OS memory manager and that can have a variety of consequences to the user experience. i.e. ultimately this one needs to be solved in the OS too.
These things are really only going to improve the CPU overhead. AMD hasn't really said that they expect GPU-side performance improvements so it's possible that they indeed do not expose additional features there, but a lot of people have been talking as if they expect that side to go faster too. So one of the two groups is wrong
Furthermore, there's nothing really on the GPU side that can't be done with GL/DX extensions so that's hardly an argument for needing a new API.