DirectX 12: The future of it within the console gaming space (specifically the XB1)

sebbbi · Aug 23, 2014

HTupolev said:
That depends on what your overall system looks like, what your task looks like, and what you mean by "IPC."

Nowadays it's hard to determine IPC from benchmarks alone. For example the reviewer might incorrectly conclude that 1.7 GHz ULV Haswell IPC must be way better than Jaguar, because it fares so well in single threaded benchmarks. However the reviewer might fail to realize that the ULV Haswell actually runs the single threaded application at 3.3 GHz. In comparison a 1.7 GHz Jaguar always runs at 1.7 GHz. If the application uses all the cores and the GPU fully, the Haswell also drops down to 1.7 GHz. The revised Puma core (successor to Jaguar) introduced proper turbo as well (New Intel ATOMs have it as well). It's pretty hard to conclude anything about the IPC of the mobile chips since both the GPU and the CPU load affect the dynamic single threaded performance that much.

Big turbo clocks (almost 2x in ULV Haswells) are great for PCs, since many productivity applications are still single threaded (and do not stress the GPU simultaneously). Jaguar just can't acceptably handle these heavy productivity applications (at todays high standards). Console games however are designed to utilize all the available cores of that particular console and all the available GPU cycles. This leaves much less turbo headroom (if any) for properly optimized AAA games. Turbo would obviously help porting games from other platforms (as the game is not designed for that particular hardware configuration in mind). On Intel CPUs you also have Hyperthreading and that improves parallel scaling beyond the physical core count (dual core ULV is pretty good in running code designed for quad core for example).

Andrew Lauritzen · Aug 23, 2014

sebbbi said:
NIt's pretty hard to conclude anything about the IPC of the mobile chips since both the GPU and the CPU load affect the dynamic single threaded performance that much.

Well... it can be measured directly with stuff like Vtune or similar of course, at least for anything x86. But ultimately the IPC of a chip isn't really as interesting as the realized performance (at a given power level depending on the application).

sebbbi · Aug 24, 2014

Andrew Lauritzen said:
Well... it can be measured directly with stuff like Vtune or similar of course, at least for anything x86. But ultimately the IPC of a chip isn't really as interesting as the realized performance (at a given power level depending on the application).

Yes. However these tools tend to be a little bit too technical for non-coders to use. And I fully agree that raw (theoretical) IPC alone is not that interesting anymore, as all PC CPUs are power constrained. Intel made a very good desicion when they started to focus heavily on power/watt (race to sleep) before their competition.

iroboto · Aug 27, 2014

Adding the Metro Eurogamer Interview here and quoting out the bits that involve X1 and DX12/DX11.1 to separate the discussions (from DF thread) into their proper places.

Source: http://www.eurogamer.net/articles/d...its-really-like-to-make-a-multi-platform-game

Relevant quotes:

Digital Foundry: DirectX 11 vs GNMX vs GNM - what's your take on the strengths and weakness of the APIs available to developers with Xbox One and PlayStation 4? Closer to launch there were some complaints about XO driver performance and CPU overhead on GNMX.

Oles Shishkovstov: Let's put it that way - we have seen scenarios where a single CPU core was fully loaded just by issuing draw-calls on Xbox One (and that's surely on the 'mono' driver with several fast-path calls utilised). Then, the same scenario on PS4, it was actually difficult to find those draw-calls in the profile graphs, because they are using almost no time and are barely visible as a result.

In general - I don't really get why they choose DX11 as a starting point for the console. It's a console! Why care about some legacy stuff at all? On PS4, most GPU commands are just a few DWORDs written into the command buffer, let's say just a few CPU clock cycles. On Xbox One it easily could be one million times slower because of all the bookkeeping the API does.

But Microsoft is not sleeping, really. Each XDK that has been released both before and after the Xbox One launch has brought faster and faster draw-calls to the table. They added tons of features just to work around limitations of the DX11 API model. They even made a DX12/GNM style do-it-yourself API available - although we didn't ship with it on Redux due to time constraints.

Digital Foundry: What's your take on DirectX 12 and Mantle? Is it all about making PC games development tie in more closely with Xbox One and PlayStation 4?

Oles Shishkovstov: Aside from them being much more close to the (modern) metal, those APIs are a paradigm-shift in API design. DX11 was 'I will keep track of everything for you'. DX12 says 'now it's your responsibility' - so it could be a much thinner layer. As for Mantle, it is a temporary API, in my honest opinion.

Digital Foundry: To what extent will DX12 prove useful on Xbox One? Isn't there already a low CPU overhead there in addressing the GPU?

Oles Shishkovstov: No, it's important. All the dependency tracking takes a huge slice of CPU power. And if we are talking about the multi-threaded command buffer chunks generation - the DX11 model was essentially a 'flop', while DX12 should be the right one.

Digital Foundry: Microsoft returning the Kinect GPU reservation in the June XDK made a lot of headlines - I understand you moved from a 900p to a 912p rendering resolution, which sounds fairly modest. Just how important was that update? Has its significance been over-played?

Oles Shishkovstov: Well, the issue is slightly more complicated - it is not like 'here, take that ten per cent of performance we've stolen before', actually it is variable, like sometimes you can use 1.5 per cent more, and sometimes seven per cent and so on. We could possibly have aimed for a higher res, but we went for a 100 per cent stable, vsync-locked frame-rate this time That is not to say we could not have done more with more time, and per my earlier answer, the XDK and system software continues to improve every month.

function · Aug 27, 2014

If simplified draw calls reduce CPU overhead and therefore memory contention, then that should increase GPU performance, based on his comments about main memory contention limiting both CPU and GPU performance.

Could Some of these resolution bumps that the Xbox One is seeing actually be due in part to a less bottlenecked GPU thanks to more main memory access?

Pixel · Aug 27, 2014

He states what we already knew about coherent memory PR awesome sauce we've heard over and over and over.

" The problem with unified memory is memory coherence. Even on consoles, where we see highly integrated SoCs (system on chips), we have the option to map the memory addresses ranges basically 'for CPU', 'for GPU' and 'fully coherent'. And being fully coherent is really not that useful as it wastes performance [bandwidth]. As for the traditional PC? Going through some kind of external bus just to snoop the caches - it will be really slow."

Andrew Lauritzen · Aug 27, 2014

Pixel said:
As for the traditional PC? Going through some kind of external bus just to snoop the caches - it will be really slow."[/I]

Just because coherence on the consoles is slow (i.e. external bus) does not mean it has to be that way. On an integrated CPU/GPU it really does not need to be any more expensive than multi-core coherence (which is not to say "free", but certainly not "really slow"); the GPU is just another participant. Even for discrete it's not as if we can't do cache coherence across sockets, although the trade-off may not be worth it for that particular case.

And as usual, coherence is not meant for thrashing a cache line around to different places at high frequencies... that is obviously always going to be slow. It *is* for efficiently migrating working sets when work gets rescheduled/stolen/forked and compared to the alternatives of always flushing that data out to DRAM and reading it back in, it is *far* more efficient!

3dilettante · Aug 27, 2014

Andrew Lauritzen said:
Just because coherence on the consoles is slow and bad (i.e. external busses?)

Bare-minimum cache hierarchy that statically links cache L2 slices with physical memory channels.
It's trivially coherent within the GPU, by architecturally disallowing multiple coherent copies.
The L1s that might have the same data in more than one place are not coherent. Accesses flagged to be coherent are mandatory misses to the L2 (which either has one copy or nothing) or main memory.

The L2 is not coherent to anything beyond the GPU, and the Onion+ bus enforces coherence by bypassing the GPU's cache hierarchy.

At least on an integrated CPU/GPU it really does not need to be any more expensive than multi-core coherence (which is not to say "free", but certainly not "really slow" in the general sense); the GPU is just another participant.

It would require a more robust memory subsystem for the GPU, as the console GPUs currently cannot partipate as peers because their memory subystem is incapable of performing the actions that are part of a protocol. While they are at it, they'd probably have to replace their aged on-die interconnect, since even the bare bones level of coherence with Onion+ is straining the CPU memory path.

Andrew Lauritzen · Aug 27, 2014

3dilettante said:
It would require a more robust memory subsystem for the GPU, as the console GPUs currently cannot partipate as peers because their memory subystem is incapable of performing the actions that are part of a protocol.

Sure I'm not saying it would be trivial to do in the consoles or anything, I'm just rejection the assertion that coherence fundamentally must be "really slow" (i.e. his last comment about the PC). I think it's clear that isn't the case.

dobwal · Aug 27, 2014

Andrew Lauritzen said:
Sure I'm not saying it would be trivial to do in the consoles or anything, I'm just rejection the assertion that coherence fundamentally must be "really slow" (i.e. his last comment about the PC). I think it's clear that isn't the case.

I think he is stating that for consoles its a performance waster and for PC (Im guessing traditional discrete setups) its slow.

3dilettante · Aug 27, 2014

dobwal said:
I think he is stating that for consoles its a performance waster and for PC (Im guessing traditional discrete setups) its slow.

I was elaborating that it was slow on the the consoles because they are architecturally incapable of performing the operations expected of a peer in a coherent cache hierarchy.

If it's a question of a GPU that did have caches with coherence states and proper integration into the protocol, it may then be a question of how much of a performance impact is considered close enough.
If the GPU is still architected to emphasize bandwidth and access coalescing, we'd have to look into what kind of long pipelines and queues are there that need to be modified to be snooped or required to drain. It would probably be slower in responding than a CPU peer, but whether it's close enough is fuzzy.

Some likely forward directions for heterogenous systems at AMD with directories and region-based coherence checks hint that the answer for them is to accept that the GPU and CPU sides will remain far enough apart that it's not close enough, at least several gens out.

dobwal · Aug 27, 2014

3dilettante said:
I was elaborating that it was slow on the the consoles because they are architecturally incapable of performing the operations expected of a peer in a coherent cache hierarchy.

If it's a question of a GPU that did have caches with coherence states and proper integration into the protocol, it may then be a question of how much of a performance impact is considered close enough.
If the GPU is still architected to emphasize bandwidth and access coalescing, we'd have to look into what kind of long pipelines and queues are there that need to be modified to be snooped or required to drain. It would probably be slower in responding than a CPU peer, but whether it's close enough is fuzzy.

Some likely forward directions for heterogenous systems at AMD with directories and region-based coherence checks hint that the answer for them is to accept that the GPU and CPU sides will remain far enough apart that it's not close enough, at least several gens out.

I was assuming Andy was referring to the 4a dev comments. The 4A devs doesn't seem to call out the consoles for being "slow" but rather wasteful, while the PC is singled out for being slow.

mosen · Aug 27, 2014

I read an old PowerPoint file from Microsoft about WDDM XP/v1.0/v2.0/v2.1, and it was interesting for me to see that Microsoft, AMD, Nvidia and Intel were talking about "Fine grained context switching" and "True preemptive multi-tasking" about 7-8 years ago. Also, IIRC DX12 requires WDDM 2.0 to work. So ,considering AMD roadmap for their dGPUs, are we going to see "Fine grained context switching" and "True preemptive multi-tasking" as DX12 feature levels? Or AMD roadmap has nothing to do with WDDM 2.0?

Cyan · Sep 2, 2014

DirectX reduces power consumption by 50% and boost frames per second by 60%.

I wonder if that applies to the Xbox One as well.

DirectX 11 to DirectX 12 shift.

DirectX 11 FPS.

DirectX 12 FPS.

http://www.extremetech.com/gaming/1...ption-by-50-boosts-fps-by-60-in-new-tech-demo

HTupolev · Sep 2, 2014

Cyan said:
DirectX reduces power consumption by 50% and boost frames per second by 60%.

In circumstances using cartoonishly huge numbers of extremely small draw calls, yes.

In applications that aren't exotic tech demos, no.

Cyan · Sep 2, 2014

HTupolev said:
In circumstances using cartoonishly huge numbers of extremely small draw calls, yes.

In applications that aren't exotic tech demos, no.

Hmmmm, possibly. That's what I expected in tech demos, but in actual games some of it should be intact, although maybe it doesn't affect the Xbox One that much, maybe it does.

The electricity bill doesn't worry me that much, but frames per second, I worry about. I really hate extremes, but in this case a 100% power consumption cut and FPS boost is something I'd love to see happening.

3dilettante · Sep 2, 2014

Cyan said:
DirectX reduces power consumption by 50% and boost frames per second by 60%. I wonder if that applies to the Xbox One as well.

It's very difficult to extrapolate from a deeply power-constrained test what a platform with almost 10x the power budget will do.
The workload is artificial, in that it is purposefully designed to highlight a heavily draw call-dependent workload that can be readily scaled and frame-limited. Because they aren't trying to simulate a fully fledged game, other sources of activity don't factor into system load.
Real workloads aren't going to be structured so that nothing obscures the measured tradeoff, and the power consumption in the DX12 fixed frame rate demo should be noted in that it marginally lowers GPU power consumption while significantly lowering CPU power consumption.

This is for a 15W platform, where both the CPU and GPU should be very capable of exceeding the budget if it weren't for power-limiting.
The improvement in power in this case, besides some GPU efficiencies, is by no longer heavily loading one CPU core that is capable of eating a huge chunk of power budget all by itself.

For the console, a single Jaguar core cannot eat up to ~100W all by itself. The power improvement by easing CPU loading comes out of the maximum power that one or a small number of Jaguar cores can pull. It's a lot when your ceiling is 15W, not so much when it's 100-120W.

Kaotik · Sep 3, 2014

Cyan said:
DirectX reduces power consumption by 50% and boost frames per second by 60%. I wonder if that applies to the Xbox One as well.

No.
In that specific scenario it's either reducing 50% power consumption or boosting FPS by 60% with same consumption, not both.

Shifty Geezer · Sep 3, 2014

Cyan said:
DirectX reduces power consumption by 50% and boost frames per second by 60%. I wonder if that applies to the Xbox One as well.

You can't reduce power draw and improve performance*. Performance was limited in the DX11 version because the CPU was maxxed out, burning full watts. In the DX12 version, the CPU is running less than maximum load, meaning less power draw. If you then add more stuff to push it back up to maximum load, it'll be using just as much power as before. Consoles are always hitting max load (in the big AAA titles), so there'll be no power savings.

* You could go partial, so less CPU use than 100% and still get better performance, but you know what I mean.

Cyan · Sep 4, 2014

3dilettante said:
It's very difficult to extrapolate from a deeply power-constrained test what a platform with almost 10x the power budget will do.
The workload is artificial, in that it is purposefully designed to highlight a heavily draw call-dependent workload that can be readily scaled and frame-limited. Because they aren't trying to simulate a fully fledged game, other sources of activity don't factor into system load.
Real workloads aren't going to be structured so that nothing obscures the measured tradeoff, and the power consumption in the DX12 fixed frame rate demo should be noted in that it marginally lowers GPU power consumption while significantly lowering CPU power consumption.

This is for a 15W platform, where both the CPU and GPU should be very capable of exceeding the budget if it weren't for power-limiting.
The improvement in power in this case, besides some GPU efficiencies, is by no longer heavily loading one CPU core that is capable of eating a huge chunk of power budget all by itself.

For the console, a single Jaguar core cannot eat up to ~100W all by itself. The power improvement by easing CPU loading comes out of the maximum power that one or a small number of Jaguar cores can pull. It's a lot when your ceiling is 15W, not so much when it's 100-120W.

Many thanks for the insight, as usual. You certainly know your stuff and have a good taste for these things.

I would say and add that I simply hope that those kind of techniques were far more commonly used -or useful- on the Xbox One, and even the WiiU or PS4.

Other than that, I'm not sure that any clearcut difference can actually be made in that regard, as they power consumption differs, but not so greatly -compared to a mobile device-.

Shifty Geezer said:
You can't reduce power draw and improve performance*. Performance was limited in the DX11 version because the CPU was maxxed out, burning full watts. In the DX12 version, the CPU is running less than maximum load, meaning less power draw. If you then add more stuff to push it back up to maximum load, it'll be using just as much power as before. Consoles are always hitting max load (in the big AAA titles), so there'll be no power savings.

* You could go partial, so less CPU use than 100% and still get better performance, but you know what I mean.

Hmmmmm. interesting. I think in this case the power consumption should be used smartly, as some kind of dynamic resolution stuff, I would offer developers my congratulations on that, offering them for the outcome -if it was realised- or the culmination of a good work in that department, but it seems like they use an automatic "Max performance settings" all of the time instead of a "smart power use" feature, which I think is being ignored nowadays, and not noticeable in a very silent console like the Xbox One. But it's there.

I am afraid to say that while I get what you mean -like 99%-, I am not sure I understand your word partial. Do you mean that going partial is possibly... = better performance with less CPU use? Is that even possible? Some kind of dynamic resolution?

Cheers!

DirectX 12: The future of it within the console gaming space (specifically the XB1)

sebbbi

Andrew Lauritzen

Moderator

sebbbi

iroboto

Daft Funk

function

None functional

Pixel

Andrew Lauritzen

Moderator

3dilettante

Andrew Lauritzen

Moderator

dobwal

3dilettante

dobwal

mosen

Cyan

orange

HTupolev

Cyan

orange

3dilettante

Kaotik

Drunk Member

Shifty Geezer

uber-Troll!

Cyan

orange

Similar threads