DirectX 12: The future of it within the console gaming space (specifically the XB1)

I've read that AMD's inferior CPU cores perform closer to Intel when programs are designed for more parallel computing.

I have a problem with "inferior" here. If the core was indeed inferior it should perform even worse in multithreaded environment. Because you cannot get 100% core-per-core advantage in any multithreaded algorithm, unless the threads are totally independent (which kind of defeats the purpose of multithreading).
The correct term here would be probably: Intel is more tolerant to bad code (and bad coders).
On the other hand the purpose of DX12 should be to unlock the GPU power and not to improve the CPU situation, because CPU should have been kicked out of the whole D3D situation years ago. But it seems like it won't happen in DX12 either, it still leaves quite a lot of tasks on CPU.
 
Also it brings up culling, as more draw calls to build something will allow for fine grained culling. I didn't even know! I always thought the GPU would just subtract the triangles it can't see, I didn't know that it was the whole mesh or none of the mesh.

It's not. It's triangles and quads and fragments which get rejected by the GPU, not meshes. By defaults that's for the viewport frustum. You can add more rejections by using a z-buffer, then also depth-coverage leads to triangles, quads and fragments being rejected.

How do batches and culling work? If you batch 100 trees in the area but 50 are covered by a mountain are you still forced to draw all 100?

You calculate the bounding volume of some mesh and clip it against the view frustum on the CPU or with compute. The ones out of view get skipped entirely. If they are instanced, then you (would) need to rewrite the instance-parameter array to skip them from run on the GPU.

If the volumes can be smaller then the culling is more effective in pushing less triangles, at the expense of more draws.
 
The correct term here would be probably: Intel is more tolerant to bad code (and bad coders).
So in your expert opinion, would a SR "core" be equivalent to a HSW one, modulo the fact that the latter is tolerant to bad code (which would be...)? Would a module be equivalent to a HSW core? What about an IVB one? Are there any relevant measurements to back this up?

Ponder upon the following: we have two workers, one can put in one nail per hour, another four nails per hour. Design A requires 4 nails, therefore the first worker will yield one finished A every four hours, whereas the second will yield 1 finished A every hour. Now, here comes brilliant design B that only requires one nail - what's the yield for our workers now? Does that mean that A is less tolerant to bad designs? Did some intrinsic fastness of A get unleashed? Did the bottleneck move?
 
Ponder upon the following: we have two workers, one can put in one nail per hour, another four nails per hour.

Not a good example. The ALU units are reasonably fast on both CPUs. Usually a fast single-thread means that the work is done on out-of-order execution, memory controller, ops dispatch, branch prediction, etc.
In fact SIMD is wider and FPU is faster in Haswell, AFAIR.
 
Not a good example. The ALU units are reasonably fast on both CPUs. Usually a fast single-thread means that the work is done on out-of-order execution, memory controller, ops dispatch, branch prediction, etc.
In fact SIMD is wider and FPU is faster in Haswell, AFAIR.

It's a pretty good example as it conveys a very precise point. I think you need to ponder more, as you did not get it. That being said, having a strong OOOE, low-latency caches, strong memory disambiguation or non abject instruction latencies does not mean you are "more tolerant to bad coders" it means you have a better CPU. Period. Those are integral parts of CPU architecture, and are non-trivial to get right; they're not an added bonus for catering to "bad coders". The statement that ALUs are "reasonably fast" is vacuous, and irrelevant: nobody cares about naked ALUs. Furthermore, it is also misleading in the sense in which BD (and its heirs) have less integer execution resources per "core" vs Intel equivalents (or even K8L or K8), it takes a module to get in the vicinity. Otherwise stated, if one copies and pastes a reasonably fast ALU and nothing else but the bare minimum, does he get a HSW competitor that does not cater to bad coders?
 
How do batches and culling work? If you batch 100 trees in the area but 50 are covered by a mountain are you still forced to draw all 100?
There are multiple different ways to do batching.

Traditionally the most common way has been to merge multiple small meshes together in a single mesh (single vertex buffer). When you combine the meshes of 100 trees in a single mesh, you either draw that mesh (containing the 100 tree forest) completely or cull it. The advantage of this method is the reduction of draw calls (100 tree forest is a single draw call). Also each tree can have unique geometry.

Geometry instancing is another way to solve the same problem. It allows you to render the same mesh (with same textures) multiple times to different locations using just a single draw call. The problem with geometry instancing is that the mesh needs to be the same for each instance. In modern games there tends to be lot of different variations of object meshes. Also for each variation there are multiple different LODs (LODs are different meshes). This means that geometry instancing is useful only for limited cases (you have lots of identical objects with identical LOD). Because object LOD (= mesh) changes based on the camera distance, you need to refresh your geometry instancing buffers every frame. Usually this task is done by the CPU. It will add some CPU cost. Another downside of geometry instancing is that it limits your ability to render strictly geometry front to back. It will thus add some overdraw in complex scenes (add GPU cost).

Multi-draw indirect is basically an improved version of geometry instancing. It allows you to change the vertex count and vertex start index per draw. This means that you can draw a different meshes. Multi-draw doesn't allow you to change state or bindings (textures) between draw calls. However with bindless texturing (or with PRT / virtual texturing) you can dynamically index to texture descriptor arrays in the shader to access different texture based on the draw id. The biggest limitation still stays: You can't change the shader between the draw calls. However if you are using g-buffering, you don't usually have that many different shaders writing to the g-buffer. Deferred lighting is done after the object rendering (nowadays usually) in a full screen compute shader pass. Techniques such as clustered deferred rendering allow you to have per pixel lighting shader permutations in the lighting step (buckets pixels to different bins based on it's lighting model).
 
I have a problem with "inferior" here. If the core was indeed inferior it should perform even worse in multithreaded environment.
In many cases it does versus its more modern competition, and the cases it doesn't typically require a doubled core count or more, and a significant expenditure of die area and power.
With Intel's process lead and much broader design progression, there is any number of implementations with core counts below, at, and far above what AMD can offer.

The correct term here would be probably: Intel is more tolerant to bad code (and bad coders).
Intel does very well with good code and programmers as well.

But it seems like it won't happen in DX12 either, it still leaves quite a lot of tasks on CPU.
It seems like there are things GPUs are not good at.
 
it means you have a better CPU

Not if your CPU performance suffers more from multi-threaded code.
It's kind of off topic, but imho GPU is "a better CPU. Period."

Intel does very well with good code and programmers as well.

Probably, but putting a lot of effort into speeding up single-threaded execution tells me that the target audience is different.

It seems like there are things GPUs are not good at.

Not in the case of graphics API.
 
I guess that's to be expected. So they get the interview, but now it sounds like they aren't going to broadcast it until after a specific date. I hope they ask the right questions, or just out right say what developers need to hear if this interview is really for helping developers understand dx12.
 
Last edited:
^ From reading their post history, they were asked by the X1-DX12 team to be able to appear on the show.

Edit: Doh - I never hit enter.

BREAKING NEWS: The Xbox One DX12 team want to come on TiC & talk I assume about how DX12 truly affects XBOX One RT
 
^ From reading their post history, they were asked by the X1-DX12 team to be able to appear on the show.

Edit: Doh - I never hit enter.
Right, however, if they have to sign NDAs because the interview contains information under NDA; I expect that interview to air when the NDA for everyone expires. I guess we could also take a stance similar to Ryan's case for Anandtech; releasing certain results but nothing else.

But thank you for sharing this, I do look forward to that interview, I hope a lot of this gets cleared up.

And I stand corrected.
 
Last edited:
Intel does very well with good code and programmers as well.
AVX2 is a good example of this. And I personally love the AVX-512 instruction set. There are so many goodies in it. AMD is still half rate in AVX (and only supports AVX1).
It seems like there are things GPUs are not good at.
Yes. However the GPU is VERY good at viewport culling and occlusion culling. This means that the GPU needs to eventually be able to submit all the draw calls (because it knows the visible set, the CPU does not). Multi-draw indirect and CUDA/OpenCL device side compute kernel enqueue are big steps in the right direction. But we are still missing graphics pipeline enqueue (and it needs to support dynamic shader selection). There isn't actually that much missing in the current GPUs to handle that.
 
And only now, in 2015, for some strange reason these features suddenly appear.
There are a lot of reasons why "now" is a good time. For some reason people have gotten the impression that because today we can do a pretty nice, portable low-overhead API, that this was always the case. It absolutely wasn't... GPU hardware itself has only recently gotten to the level of generality required to support something like DX12. It would not have worked nearly as nicely on a ~DX10-era GPU...

Another huge shift in the industry lately is the centralization of rendering technology into a small number of engines, written by a fairly small number of experts. Back when every studio write their own rendering code something like DX12 would have been far less viable. There are still people who do not want these lower-level APIs of course, but most of them have already moved to using Unreal/Unity/etc :)

More than that, I've seen numerous developers trying to persuade MSFT that the change is needed, but it didn't happen up until now.
The notion that it hasn't been blatantly clear to everyone for a long time that this was a problem is silliness. Obviously it has been undesirable from day one and it's something that has been discussed during my entire career in graphics. Like I said, it's a combination of secondary factors that make things like DX12 practical now, not some single industry event.

yet they still manage to drive forward a much needed initiative like Mantle, and then everyone else kicks them into the shins, takes the project and runs off with it. Up next: Nvidia and Intel take Adaptive sync via display port and run off with it.
That's a twisted view of the world. Frankly it's not terribly difficult to make an API that maps well to one piece of hardware. See basically every console API ever. If you feel bad that the other "modern" APIs are apparently ripping off Mantle (and just trivially adding portability in an afternoon ;)) then you should really feel sorry for Sony and others in the console space who came up with these ideas much earlier still (see libgcm). But the whole notion is stupid - it's a small industry and we all work with each other and build off good ideas from each other constantly.

Ultimately if Mantle turns out to be a waste of engineering time (which is far from clear yet), AMD has no one to blame but themselves in terms of developing and shipping a production version. If their intention was really to drive the entire industry forward they could have spent a couple months on a proof of concept and then been done with it.

The adaptive sync thing is particularly hilarious as that case is a clear reaction to what NVIDIA was doing... but building it on top of DP is precisely because this stuff was already in eDP! It's also highly related to other ideas that have been cooking there for some time (panel self refresh). Don't get fooled by marketing "we did it first" nonsense.

So much goes on behind the scenes of API development that I'm not sure we will know the full story, or if given the number of players and viewpoints we can.
Yep, as you say I'm not even sure the notion of a "full story" is well-defined, despite human nature wanting a tidy narrative :) Certainly multiple parties have been working on related problems for quite a long time here, and no one party has perfect information. As folks may or may not remember from the B3D API interview (http://www.beyond3d.com/content/articles/120/3), these trends were already clearly established in 2011. Even if Mantle was already under development then (and at least it seems like Mike didn't know about it if it was), I certainly didn't know about it :) Yet - strangely enough - we were still considering all of these possibilities then.

A lot of things can be done to shift things one way or another, and each party can cite their fraction of the overall story to justify their desired narrative.
Exactly, which is why it's ultimately a useless exercise. All of the statements can even be correct, because that's how the industry actually works: nothing ever comes from a vacuum.

It's kind of off topic, but imho GPU is "a better CPU. Period."
Where's the eye roll emoticon when I need it...

Yes. However the GPU is VERY good at viewport culling and occlusion culling. This means that the GPU needs to eventually be able to submit all the draw calls (because it knows the visible set, the CPU does not).
I'm not sure I'm totally bought in to either sentence there :) The GPU is pretty good at the math for culling, but it's not great with the data structures. You can argue that for the number of objects you're talking about you might as well just brute force it, but as I showed a few years ago with light culling, that's actually not as power-efficient. Now certainly GPUs will likely become somewhat better at dealing with less regular data structures, but conversely as you note, CPUs are becoming better at math just as quickly, if not *more* quickly.

For the second part of the statement, maybe for a discrete GPU. I'm not as convinced that latencies absolutely "must be high" because we "must buffer lots of commands to keep the GPU pipeline fed" in the long run. I think there is going to be pressure to support much more direct interaction between CPU/GPU on SoCs for power efficiency reasons and in that world it is not at all clear what makes the most sense. Ex. we're going to be pretty close to a world where even a single ~3Ghz Haswell core can generate commands faster than the ~1Ghz GPU frontend can consume them in DX12 already so looping in the GPU frontend is not really a clear win if we end up in a world with significantly lower submission latencies.

It may go in the direction that you're saying and we certainly should pursue both ways, but I disagree that it's a guaranteed endpoint.
 
Last edited:
I'm not sure I'm totally bought in to either sentence there :) The GPU is pretty good at the math for culling, but it's not great with the data structures. You can argue that for the number of objects you're talking about you might as well just brute force it, but as I showed a few years ago with light culling, that's actually not as power-efficient. Now certainly GPUs will likely become somewhat better at dealing with less regular data structures, but conversely as you note, CPUs are becoming better at math just as quickly, if not *more* quickly.

For the second part of the statement, maybe for a discrete GPU. I'm not as convinced that latencies absolutely "must be high" because we "must buffer lots of commands to keep the GPU pipeline fed" in the long run. I think there is going to be pressure to support much more direct interaction between CPU/GPU on SoCs for power efficiency reasons and in that world it is not at all clear what makes the most sense. Ex. we're going to be pretty close to a world where even a single ~3Ghz Haswell core can generate commands faster than the ~1Ghz GPU frontend can consume them in DX12 already so looping in the GPU frontend is not really a clear win if we end up in a world with significantly lower submission latencies.

It may go in the direction that you're saying and we certainly should pursue both ways, but I disagree that it's a guaranteed endpoint.
Agreed. The hardware and the APIs certainly should pursue both ways. Let the programmer choose where to run each part of the code.

Brute force light tile culling is quite different from GPU side viewport and occlusion culling. In tile light culling you are doing awfully lot of duplicate work. You would never do it like this on the CPU.

In contrast the GPU side viewport culling does exactly the same amount of work as the CPU side viewport culling. Most developers do not use complex pointer indirection based structures such as octrees anymore to cull their objects as vectorized (AVX) brute force culling on CPU is much faster (considerably better data access patterns). DICE had a nice paper about this several years ago (they claimed 3x perf boost compared to complex algoritm). Our GPU implementation can cull 2 million object scene in around 0.2 milliseconds (including both occlusion and viewport culling at sub-object granularity). We used to use a whole CPU core (16.6 ms) for this task (and we had a much smaller data set). The nice thing about GPU culling is that it can directly read the GPU data (such as the depth buffer) with zero additional latency, making it possible to do very precise culling. The precision improvement (sub-object granularity) actually saves more than 0.2 ms of GPU time in rasterization, meaning that the GPU culling is practically "free".

The energy efficiency of the GPU crunching through a big array of tightly packed 16 bit bounding sphere data is certainly better than the CPU crunching through this same array with AVX (OOO machinery and all that). And the CPU AVX version is certainly more energy efficient than the CPU scalar version with a pointer soup (cache miss hell with lots of partially used cache lines loaded and lots of page misses).
 
Another huge shift in the industry lately is the centralization of rendering technology into a small number of engines, written by a fairly small number of experts. Back when every studio write their own rendering code something like DX12 would have been far less viable. There are still people who do not want these lower-level APIs of course, but most of them have already moved to using Unreal/Unity/etc :)

That is kind of a misleading statement, if not a myth.

Almost all the studios which survived till now still have their own engines! Basically all console newcomers (RAD fe.) still build their own new engines, which have some chance of being ported to PC (see Resident Evil).
If I would have to make a non-conservative guess, we still have around 50-70 AA+ custom engines with 5+ games done on it, over the last decade ofc. And a great deal of them are stuck at DX9/OGL2 and will never move forward.
There are more studios, and about a constant or slightly raising number of engines (new appear because of console start-ups, old vanish because of bancrupcy, but market volume grows), that changes statistics, but's not that we have now only a handful of engines, and all is nice because we could only "break" 1 or 2 providers and their engine.

It's always possible to move forward, it was always possible to make "cool" APIs. I conjecture the problem ist _not_ the game developers, the problem is that the API providers don't know where something experimental leads to, and because it's business they feel reluctant to throw resources behind a crazy idea, they don't want market segmentation and whatnot business-based concepts. Which is especially cynical, because MS could just in theory hire enough programmers to rewrite all existing game engine to DXnext, they would loose so little of monetary value in comparison to the money they do have.

Ultimately, the reason for DX12 coming 12 years too late, is business, nothing else IMHO. And Mantle broke away with it, an API for game developers, not for moni-making, or windows surfaces, or whatever MS has to juggle. I don't care, I want a RT-capable thin GPU access-layer, and a company with the API monopoly to react to the customers (my) needs.
 
It would not have worked nearly as nicely on a ~DX10-era GPU.

Why?

Another huge shift in the industry lately is the centralization of rendering technology into a small number of engines, written by a fairly small number of experts.

No, it was the case last gen. This gen will move from that into the "own engine" path once again. And we can already see it. Example: UE is not sold for 7 digit numbers, but is "free-to-play".
My opinion is that the "centralization of rendering technology" happened last gen because a lot of people feared PS3 hardware, and that "centralization" promised them that it will take care of it, bottom line - it never delivered.

The notion that it hasn't been blatantly clear to everyone for a long time that this was a problem is silliness.

I doubt that, I preached it to developers since 2006, and it was met with "Where's the eye roll emoticon" each time I've tried. So, I'm in kind of deja vu here. :)

Ex. we're going to be pretty close to a world where even a single ~3Ghz Haswell core can generate commands faster than the ~1Ghz GPU frontend can consume them

Yes, that could happen, that's why we remove the frontend. And program GPU directly.
Right now the situation is indeed silly: GPU is a real general purpose processor right now, it can run any code, but we cannot program it, because it is managed by its fixed-function hardware, if we could program the FFP from GPU it would be nice, but we cannot, because FFP is managed by yet some other hardware: CP, which is a totally standard CPU, but we cannot program directly even that, because it's managed by the byte-code generated by driver.
 
Multi-draw indirect and CUDA/OpenCL device side compute kernel enqueue are big steps in the right direction. But we are still missing graphics pipeline enqueue (and it needs to support dynamic shader selection). There isn't actually that much missing in the current GPUs to handle that.

While shader changes are not possible yet, the NV_command_list provides added flexibility to Multi-Draw-Indirect and allows for efficient culling
https://github.com/nvpro-samples/gl_occlusion_culling
https://github.com/nvpro-samples/gl_cadscene_rendertechniques

Even with shader changes on CPU, it is possible to cull the scene at once and create commandbuffers in "per-shader" sections. Will present details at NVIDIA's GTC, but essentially the code is available in those samples. I do think as well that enhancing GPU's work creation is something that adds something new and powerful, and other apis not enhancing in this direction seems like a missed opportunity.
 
While shader changes are not possible yet, the NV_command_list provides added flexibility to Multi-Draw-Indirect and allows for efficient culling
This forthcoming extension looks nice :)
Even with shader changes on CPU, it is possible to cull the scene at once and create commandbuffers in "per-shader" sections. Will present details at NVIDIA's GTC, but essentially the code is available in those samples. I do think as well that enhancing GPU's work creation is something that adds something new and powerful, and other apis not enhancing in this direction seems like a missed opportunity.
You can do one MDI per shader type. This gives you constant amount of MDI calls. If you have too many shader types some MDI calls will be often empty (when a shader is not used in the currently visible set of geometry). Empty MDI calls do not cost much (if you do only a dozen of them in total). You could also use predicated rendering to skip over the unused ones. A single culling shader can feed all the MDI calls (bin each visible object to different bin based on the shader id).
 
Back
Top