DirectX 12: The future of it within the console gaming space (specifically the XB1)

The question I ask is:
If you only have one GPC, can you max it out? Do any of you have knowledge of such cases?
If so, as Pixel states, why hasn't AMD introduced a second GPC on its newest cards?
Honestly I see the XBox GPU beeing bottlenecked by other stuff before a single GPC is maxed out.
PS4 also has a second GPC (apparently without compute commands capability), and Sony did not create it's console around any kind of DX 12, so I see this as a probable benefit for OS interactions rather than anything else.
 
Maybe this can help:

Oxide Games has emailed us this evening with a bit more detail about what's going on under the hood, and why Mantle batch submission times are higher. When working with large numbers of very small batches, Star Swarm is capable of throwing enough work at the GPU such that the GPU's command processor becomes the bottleneck. For this reason the Mantle path includes an optimization routine for small batches (OptimizeSmallBatch=1), which trades GPU power for CPU power, doing a second pass on the batches in the CPU to combine some of them before submitting them to the GPU. This bypasses the command processor bottleneck, but it increases the amount of work the CPU needs to do (though note that in AMD's case, it's still several times faster than DX11).

http://www.anandtech.com/show/8962/the-directx-12-performance-preview-amd-nvidia-star-swarm/6
 
Thanks Starx! Really helpful!

But Star Swarm is a demo made to pull drawcalls to the limit. It's a Benchmark made for that specific task, not a real game!
Besides the card on the benchmarks is a lot more powerfull than consoles GPUs, beeing able to handle a lot more work!
But based on the Star Swarm results it seems that only very specific games with thousands of small objects on screen could really benefit from that! And they are not really that common. Maybe AMD just went with a faster/improved GCP on newer cards?
 
Thanks Starx! Really helpful!

But Star Swarm is a demo made to pull drawcalls to the limit. It's a Benchmark made for that specific task, not a real game!
Besides the card on the benchmarks is a lot more powerfull than consoles GPUs, beeing able to handle a lot more work!
GCP Is a separate workload item from actual muscle. Higher end cards did not ship with more
Powerful GCPs for GCN1.0 afaik. And you have minor confusion between where the bottleneck lies in this case for draw calls. Under serially submitted APIs the CPU becomes the bottleneck, it cannot submit enough draw calls for the GPU to do work. Nearly impossible for it to flood the GCP. With directx11 this is even less likely so.

The demo shows how with multithreaded draw calls vs serial draw calls which would bring a high end corei7 to its knees on dx11 does not occur with mantle/dx12/Vulcan due to the nature of the API supporting lower overhead and multithreaded draw calls. It wasn't so much a a hat tip to Gpu performance.
 
Actually I would think reducing overdraw by ordering your drawcalls in such a way that breaks batching could increase performance by being more efficient by using more draw calls.
Agreed, actually quite earlier in this thread we had this exact discussion:
quite a few pages back, but we were posting around that I think around here:
#833 or here #836

Forumaccount also put in his two cents back then:
A typical game will not be running large numbers of small draws. Indeed, it's a stupid idea especially on GCN. The only reason they'd do it is to handle these CPU benchmarks. You shouldn't be surprised to find programs written in unique ways exposing bounds that are completely uninteresting to nearly everyone else.
 
GCP Is a separate workload item from actual muscle. Higher end cards did not ship with more
Powerful GCPs for GCN1.0 afaik. And you have minor confusion between where the bottleneck lies in this case for draw calls. Under serially submitted APIs the CPU becomes the bottleneck, it cannot submit enough draw calls for the GPU to do work. Nearly impossible for it to flood the GCP. With directx11 this is even less likely so.

The demo shows how with multithreaded draw calls vs serial draw calls which would bring a high end corei7 to its knees on dx11 does not occur with mantle/dx12/Vulcan due to the nature of the API supporting lower overhead and multithreaded draw calls. It wasn't so much a a hat tip to Gpu performance.

Thanks for the clarification about the GCP performance not changing with more powerfull cards. I always thought the GPC performance was associated with the card global performance power.

But if that is so, then we also know general console games are not GCP limited as PC versions get higher framerates on more powerfull rigs, meaning the bottleneck is elsewhere.

In this case, if GCP was the limiting factor, a better card would not bring any performance improvements.

Did I missunderstood?
 
Thanks for the clarification about the GCP performance not changing with more powerfull cards. I always thought the GPC performance was associated with the card global performance power.

But if that is so, then we also know general console games are not GCP limited as PC versions get higher framerates on more powerfull rigs, meaning the bottleneck is elsewhere.

In this case, if GCP was the limiting factor, a better card would not bring any performance improvements.

Did I missunderstood?
I wouldn't link GCP performance to the performance of the GPU.
The GCP is meant to schedule work being sent to it, in times in which the GCP is overloaded perhaps that's a different discussion point. But like all things developers will trade off their available resources (in this case CPU) to reduce the load on the GCP( if that is actually a factor). Or increase pressure on the GCP to relieve pressure on the CPU.
There's an interplay of resource management between the CPU and GPU (when it comes to this). We've never really run into this problem before because I don't think DX11 APIs were capable of flooding the GCP before the CPU crapped out. But with low overhead and multithreaded draw calls we can for the first time, so I guess it depends on what the developer is going to try to do with their next set of games?

I think instead of looking at straight numbers, perhaps it's best to look at what more draw calls can unlock for a game, like we could have more different type of interaction with our games, because I think if I understand ti, the act of reducing draw calls also limits dynamism in a game, meaning it's easy to reduce draw calls in a scene, but at the cost of the scene being rather static. But with lots of draw calls available, I think you can have deforming terrain, or amor, or different ways to layer items on top of each other (in real time) which is something that is normally optimized out (if you are tight on draw call budget).

I think the assumption that a second GCP is needed only if the first one maxes out is likely incorrect. I think the assumption that just because you can flood a GCP means that the GPU will be able to schedule work better than well optimized delivered draw calls can be completely wrong. Mainly that there is _no_ detriment in performance to running multithreaded draw calls at the GPU level vs well optimized serially submitted ones. When in reality that may not be true at all.

The second assumption I wouldn't make it that the second GCP is only useful after the first is maxed out, I think that's likely also incorrect.

The third assumption I wouldn't make is that having a second GCP will overall schedule better than a single one. If that was true, and our resident leaker is correct, then there would be no reason for MS to keep it held, unless it actually negatively impacts performance - which then it would be a legitimate reason to keep it locked until games were designed to surpass some sort of draw call threshold where having 2 GCPs were more beneficial than say a single one (despite the negative impact).
 
Last edited:
The second graphics command processor was not exposed to games, but that is not the same thing as not being used. There are use cases for maintaining quality of service for the application/system portions.
The compute front ends might have similar reasons, but other GCN GPUs have had more than one before the consoles.
 
so it seems like MS just added a second command processor so the draw calls are not the bottleneck (again). But why the hell a second compute command processor. I can't imagine a situation where the second compute command processor should get the bottleneck.

The second graphics command processor is likely used for the same thing that the second graphics command processor is used for on the PS4 to keep the system drawing decoupled from the game allowing the system to draw smoothly even if the game is under load.
 
What discussion Shifty Geezer?
Quoting you in the "Xbox One November SDK leaked" thread, after seeing this.

"And there we have two GCPs in PS4! All this time with people saying there's only one in PS4 and XB1 is unique, but they're both the same (shocker). That's another big clue that it's for system drawing, and time to put this to bed unless there's a shocking revelation in the future."
 
Yeah this is the next step... I just got done working on reducing draw calls as much as possible. Now the next iteration of my stuff needs to throw all that out and use some extremely large number of draw calls due to the flexibility it gives (when done on GPU). In the future I hope the concept of a draw call goes away... these days we have less and less need to use them to break up blocks of fixed-function state.
The idea of a draw call isn't going away as you'll always need to submit work and it needs to be called something. All else being equal fewer draw calls are better than more as they use less CPU power and system bandwidth. I hope developers don't use a lot of draw calls just because they think they can.
 
The idea of a draw call isn't going away as you'll always need to submit work and it needs to be called something. All else being equal fewer draw calls are better than more as they use less CPU power and system bandwidth. I hope developers don't use a lot of draw calls just because they think they can.
Do you agree there is now leg room to draw front to back instead of being forced to batch? Do you think its a worthwhile avenue of pursuit?
 
Do you agree there is now leg room to draw front to back instead of being forced to batch? Do you think its a worthwhile avenue of pursuit?
Wow, can you expand on this? i was under the assumption that we did back to front to minimize artifacts and errors, and that it was much simpler. Our APIs have depth checks to see which objects are background to foreground. Is front to back is only limited due to draw calls?
 
Wow, can you expand on this? i was under the assumption that we did back to front to minimize artifacts and errors, and that it was much simpler. Our APIs have depth checks to see which objects are background to foreground. Is front to back is only limited due to draw calls?
For opaques you can submit geometry in whatever order you want and still get correct results (assuming zbuffering is enabled). However to take advantage of Hi-Z and early-Z (perf and efficiency enhancers) you need to draw opaques front to back. In D3D11 and previous you would become CPU limited if you used to many draw calls. So instead of drawing front to back you would order your draw calls in such a way you could batch them to minimize draw calls so you didn't become CPU limited but in exchange you wouldn't draw your geometry in an optimal order. So with more draw calls you don't have to batch and can draw in a more optimal order. In truth though I've become enamored with what sebbbi and the assasin's creed unity team described in there GPU driven pipeline presentation... I think that might be the future of rendering for a while to come.

edit - also here is something for you to read: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Depth_in-depth.pdf
edit2 - also in regards to being cpu bound you might want to read this old document: https://www.nvidia.com/docs/IO/8228/BatchBatchBatch.pdf
 
Last edited:
I thought Front to Back was limited due to all the costly rays needed to calculate occlusion and transparency? Would a second GCP affect that, my understanding of the GCP is that it can use resources either not allocated to another job or sub in jobs while the first is stalled due to a RAM lookup or some such.
 
For those having good D3D12 engines running (not many I would think) it should be easy to change sorting order, there should already be code to depth sort (although B2F for translucent elements).
I'd definetly benchmark this to see what's faster.

That said the cost of changing pipeline state object may not have changed much, in which case depth sorting (instead of depth pass ?) might not be that good.
On one hand you may be able to remove the depth pass, on the other you still have to pay the cost of pipeline state objects change... (So it would also depend on how many different pipeline state objects you have, with physically based BSDF, it may reduce their number significantly)
 
Back
Top