Just how limited is DirectX compared to a console API?

Ron Burgundy

Newcomer
Hello. I'm sure you have had a few threads on this exact topic but I haven't came across a finite answer to this subject. There are many things I am curious about in terms of API control and how much of it is lost with DirectX.

There are a lot of people out there who throw around the optimization card. But just how far can full hardware control take performance over D3D11? I understand that single spec coding is also a huge factor too.

And if you could, please elaborate on some of the things a low level API will allow the developer to do.

Thanks.
 
Look at the Microsoft DirectX 12 presentations. They are talking about the Windows driver model and how they managed to drastically lower the API/driver overhead. Console APIs obviously are even lower level than that, but the presentations gives you a good example about the driver/API overhead on PC.
 
Look at the Microsoft DirectX 12 presentations. They are talking about the Windows driver model and how they managed to drastically lower the API/driver overhead. Console APIs obviously are even lower level than that, but the presentations gives you a good example about the driver/API overhead on PC.

Thanks for the reply. Erm.. do you know which specific presentation outlines the prior D3D bottlenecks? I heard that there are a few factors other than API too. That even DX11 could perform better on a single spec hardware. And that the actual API performance increase of a low level API compared to a high level API is around 40% on the same hardware. Is that 40% on the GPU side? And then there is memory management. How does a low level API affect the use of memory?

Questions lol.
 
Low level APIs usually have fully manual resource management. You get a big continuous chunk of memory and need to implement your own memory allocators to split that memory according to your needs. Also if you need dynamic resource updates, you need to manually double buffer your resources (so that you do not accidentally modify a resource that is currently read by the GPU). This is faster and allows you to control your memory layout better. Often with manual memory management you can pack resources tighter. Driver might for example allocate a whole page (4096 bytes or more) for each resource to combat memory fragmentation (easy to swap in/out whole pages). This is quite a bit overhead for small resources such as constant buffers.
 
I see. So with D3D you can't specifically manage memory blocks? Does this essentially mean that there is a lot of memory that could possibily be unaccessible because of D3D?

Also, what would this mean for performance in the average game? Perhaps less memory needed to run games on a console?
 
Low level APIs usually have fully manual resource management. You get a big continuous chunk of memory and need to implement your own memory allocators to split that memory according to your needs. Also if you need dynamic resource updates, you need to manually double buffer your resources (so that you do not accidentally modify a resource that is currently read by the GPU). This is faster and allows you to control your memory layout better. Often with manual memory management you can pack resources tighter. Driver might for example allocate a whole page (4096 bytes or more) for each resource to combat memory fragmentation (easy to swap in/out whole pages). This is quite a bit overhead for small resources such as constant buffers.
So the myth of "programming to the metal" is not real and is not about coding using the most obscure functions of a chip but just about special memory management?

That's tough to accept, I always had that view on low level programming, like there was some mystery to it.
 
So the myth of "programming to the metal" is not real and is not about coding using the most obscure functions of a chip but just about special memory management?

That's tough to accept, I always had that view on low level programming, like there was some mystery to it.

If you don't have low level access, then there are things you can still do, if I recall it's faster for array size of foo[x][y] to iterate through [foo][x+1][y] than for [foo][x][y+1], the latter will need to hit memory, the former should be held in cache. [or the opposite can't quite recall]

So the more you get things working in registers and cache and the less you hit memory the better. And if you aren't there yet the more you are hitting memory is better than hitting the hard drive.

There are special processing units like SIMD for instance, where it can handle math for 4x ints. So instead of doing something like performing an operation between two arrays and incrementing by 1, you can actually load 4 ints at once from both arrays and increment by 4.

one way to look at it, is really just about reducing wasted cycles and letting the faucet run with the fewest impediments. For example, it's better to work with allocations in the power of 2 so that the processor will only need to do bit shifts for division and multiplication instead of running costly operations.

reducing branches is a big one, but honestly I think this is generally crazy hard to do. I recall getting heavily penalized for creating an object during run-time, so I created all objects before hand and iterated through them to determine if the object was enabled or not, if enabled run the update and the render for that object. Too many for loops and too many if/else statements.

But honestly thinking about it, there was probably a better way to do this but even now I'm not sure, I'm just an amateur. I made particle effects the same way LOL, they were moving textures exploding in random directions.

memory management is the biggest one because each game is unique in the problems that they're trying to solve, so how the API handles the moving of data is a big deal, but as much as memory management is a big deal taking advantage of the hardware that is available is also a big deal.

Looking back I wish I understood more about hardware then when I was working on the game; regardless it was still a great learning experience.
 
So the myth of "programming to the metal" is not real and is not about coding using the most obscure functions of a chip but just about special memory management?

That's tough to accept, I always had that view on low level programming, like there was some mystery to it.
I was only roughly describing how the memory management in general differs between high level and low level APIs. That is surprisingly big part of the driver overhead on PC.

On PC DirectX (11.2) you can't even use all the features of modern GPUs. OpenGL 4.4, OpenCL 2.0 and CUDA support much bigger set of Kepler and GCN features. Some features are ARB and some need vendor specific extensions. Examples: Kepler dynamic parallelism (GPU can feed itself by spawning more compute kernels from kernels), similar OpenCL 2.0 feature (GCN), multidraw (GPU executes array of draw calls. Parameters and draw call count come directy from a GPU buffer, allowing GPU itself to setup huge amount of draw calls quicky without CPU roundtrip), bindless resources (GPU doesn't need CPU to bind resources to resource slots. GPU can directly read resource descriptors from programmable memory locations and use them to sample textures and read/write data to buffers). There are countless of smaller vendor specific extensions to the shader language as well (such as AMD ballot and NVIDIA warp vote, that allow more efficient cooperation between threads).

If you want some low level GCN shader core optimizations, please read these:
http://michaldrobot.com/2014/05/12/low-level-optimizations-for-gcn-digital-dragons-2014-slides/
http://michaldrobot.com/2014/04/01/gcn-execution-patterns-in-full-screen-passes/

These presentations should at least give some insight about the possible performance gains when optimizing your shader code (or execution pattern) for a single target architecture.

Low level programming is not mysterious. I personally feel that the modern high level APIs such as OpenGL and DirectX 11 have become so big and are filled with so many obscure features (especially OpenGL with it's full backwards compatibility) that a modern low level API is actually less mysterious to use. I understand low level APIs better, especially when it comes to getting the best possible performance out of them.
 
Last edited by a moderator:
I was only roughly describing how the memory management in general differs between high level and low level APIs. That is surprisingly big part of the driver overhead on PC.

On PC DirectX (11.2) you can't even use all the features of modern GPUs. OpenGL 4.4, OpenCL 2.0 and CUDA support much bigger set of Kepler and GCN features. Some features are ARB and some need vendor specific extensions. Examples: Kepler dynamic parallelism (GPU can feed itself by spawning more compute kernels from kernels), similar OpenCL 2.0 feature (GCN), multidraw (GPU executes array of draw calls. Parameters and draw call count come directy from a GPU buffer, allowing GPU itself to setup huge amount of draw calls quicky without CPU roundtrip), bindless resources (GPU doesn't need CPU to bind resources to resource slots. GPU can directly read resource descriptors from programmable memory locations and use them to sample textures and read/write data to buffers). There are countless of smaller vendor specific extensions to the shader language as well (such as AMD ballot and NVIDIA warp vote, that allow more efficient cooperation between threads).

If you want some low level GCN shader core optimizations, please read these:
http://michaldrobot.com/2014/05/12/low-level-optimizations-for-gcn-digital-dragons-2014-slides/
http://michaldrobot.com/2014/04/01/gcn-execution-patterns-in-full-screen-passes/

These presentations should at least give some insight about the possible performance gains when optimizing your shader code (or execution pattern) for a single target architecture.

Low level programming is not mysterious. I personally feel that the modern high level APIs such as OpenGL and DirectX 11 have become so big and are filled with so many obscure features (especially OpenGL with it's full backwards compatibility) that a modern low level API is actually less mysterious to use. I understand low level APIs better, especially when it comes to getting the best possible performance out of them.

Yes. That is what I read somewhere too. That Keplar and GCN have GPU features that DX11 can't utilize and doesn't even know exists. Am I to understand that GNM on PS4 utilizes all of these GPU features to boot? Same with Xbox's DX11.x?

So what are the main differences with coding for PC (DX11) and for console? I suspect a lot of the control is taken away on PC? And to what degree could performance increase from a higg level API to a low level?
 
I was only roughly describing how the memory management in general differs between high level and low level APIs. That is surprisingly big part of the driver overhead on PC.

On PC DirectX (11.2) you can't even use all the features of modern GPUs. OpenGL 4.4, OpenCL 2.0 and CUDA support much bigger set of Kepler and GCN features. Some features are ARB and some need vendor specific extensions. Examples: Kepler dynamic parallelism (GPU can feed itself by spawning more compute kernels from kernels), similar OpenCL 2.0 feature (GCN), multidraw (GPU executes array of draw calls. Parameters and draw call count come directy from a GPU buffer, allowing GPU itself to setup huge amount of draw calls quicky without CPU roundtrip), bindless resources (GPU doesn't need CPU to bind resources to resource slots. GPU can directly read resource descriptors from programmable memory locations and use them to sample textures and read/write data to buffers). There are countless of smaller vendor specific extensions to the shader language as well (such as AMD ballot and NVIDIA warp vote, that allow more efficient cooperation between threads).

If you want some low level GCN shader core optimizations, please read these:
http://michaldrobot.com/2014/05/12/low-level-optimizations-for-gcn-digital-dragons-2014-slides/
http://michaldrobot.com/2014/04/01/gcn-execution-patterns-in-full-screen-passes/

These presentations should at least give some insight about the possible performance gains when optimizing your shader code (or execution pattern) for a single target architecture.

Low level programming is not mysterious. I personally feel that the modern high level APIs such as OpenGL and DirectX 11 have become so big and are filled with so many obscure features (especially OpenGL with it's full backwards compatibility) that a modern low level API is actually less mysterious to use. I understand low level APIs better, especially when it comes to getting the best possible performance out of them.
Many thanks for sharing, sebbbi. Those links are quite interesting, albeit some terms... I gotta say are out of my reach.

Judging from other posts of yours, some methods you -and maybe other developers- tried are going to set some foundations for the enhancement of the code in modern consoles and popularise them. At least, that's what I've understood and what I hope read. What I noticed in the articles you linked to is that they are focused on GCN instructions, basically.

Aside from that, do keep in mind that I always thought low level programming, coding to the metal, was using the most obscure GPU ... AND CPU functions, but maybe CPU tricks are not needed anymore and it's all about the GPU -unlike during the Cell era, where people had a lot of expectations on the Cell-.

My only fear with to the metal programming and some of the methods the new techniques suggest is backwards compatibility. :/ I want future consoles to be 100% retrocompatible, to build up a library of games as solid as the PC, to be played through entire generations.
 
Back
Top