DirectX 12: The future of it within the console gaming space (specifically the XB1)

I think there's a good chance that the same system/game split exists for both consoles.

I see benefits to having a reserved graphics front end because it isolates the system/app side from the game. This keeps the game from doing something that spams the GPU and causes the OS to suffer, and provides a predictable and complete performance budget for the game.
Having the system's graphics context initialized would also skip some heavier context switching and initialization of state whenever changing between the game and system partition.

In a virtualized system, having two front ends may enable less overhead in virtualization, especially since GPU virtualization is not as mature as it is for the CPUs.
 
Couldn't agree more with that list. Many very important features listed there.

3.10... "ballotAMD" :). For the people who don't see it right away (it wasn't mentioned in this article either), this allows you to do efficient inside wave prefix sums (without atomics / serialization). Super useful for many things.

Quote from the OpenGL 5 candidate feature list:

Good example of this is shadow map rendering. It is bound by fixed function hardware (ROPs and primitive engines) and uses very small amount of ALUs (simple vertex shader) and very small amount of bandwidth (compressed depth buffer output only, reads size optimized vertices that don't have UVs or tangents). This means that all TMUs and huge majority of the ALUs and bandwidth is just idling around while shadows get rendered. If you for example execute your compute shader based lighting simultaneously to shadow map rendering, you get it practically for free. Funny thing is that if this gets common, we will see games that are throttling more than Furmark, since the current GPU cooling designs just haven't been designed for constant near 100% GPU usage (all units doing productive work all the time).
I can't wait for those new techniques to come out and used in games, where usually shadows are very expensive, and normally poor looking too in many cases. (it's almost impossible to find a game casting shadows correctly, you just have to drive a car in real life at night and drive a car in a game -night racing-)

I hope we can emulate shadows in real life. For now, shadows kinda make me sad.

Like the poem goes :smile::

Home is behind
The world ahead
And there are many paths to tread
Through shadow
To the edge of night
Until the stars are all alight

Mist and shadow
Cloud and shade
All shall fade
All shall fade
 
In a virtualized system, having two front ends may enable less overhead in virtualization, especially since GPU virtualization is not as mature as it is for the CPUs.

hmm... GPU virtualization should be way easier than CPU. Each kernel does automatically have a full context of its own, and the only place where you need arbitration/virtualization is mainly memory access/globals. I see it as much less troublesome than CPU ones. Where do you see problems?
 
From what i can see at this quote (from EG/DF):



XB1 benefits from what sebbbi said at his post rigth now, But the difference is that in this example the system is using ROPs for rendering and title is using CUs for synchronous compute operations.



But I have no idea about 2 graphics command processors on XB1. PS4 has 2 graphics command processors like XB1 but one of them is exclusive for system (VShell) and has no compute capabilities like the other one which is exclusive for games. But it seems that on XB1 both graphics command processors are the same. So it should be possible to use both of them for games (while the system didn't need them).



Actually I posted the official slide, not him. ;)

Given that Sony explicitly noted the difference between the pipes and MS didn't, that implies that for the Xbone, they are identical and could be active simultaneously. If that is true and if a substantial portion of the cycles can benefit for this dual use scenario, then ole Albert's comments make a lot more sense (i.e. scenarios where system with less raw flops outperforms in real world output)
 
hmm... GPU virtualization should be way easier than CPU. Each kernel does automatically have a full context of its own, and the only place where you need arbitration/virtualization is mainly memory access/globals. I see it as much less troublesome than CPU ones. Where do you see problems?

The GPU as a device has historically not been well-virtualized, which is why only some of the most recent architectures have started touting virtualized grid products.
The compute portion of the GPUs is stateless, but there is global state for the device and fixed-function contexts, including the control registers and system variables for the device.
The command processors and their system state are the bottleneck.

Nvidia's cloud cards have multiple GPUs on a board, with one user per chip.
AMD has a straightfoward passthrough mode and a more complex method of hosting up to four users per card.
Managing multiple clients for a single device apparently incurs significant penalties in performance and provides non-uniform latencies due to having to go through a virtual device driver. The restricted 1:1 passthrough allows for more direct access.

The game and app partitions would appear to be different clients to the GPU, but you'd want the responsiveness and performance of a passthrough mode. If there are dedicated front ends for each partition, then there's direct access for each side.
 
Given that Sony explicitly noted the difference between the pipes and MS didn't, that implies that for the Xbone, they are identical and could be active simultaneously. If that is true and if a substantial portion of the cycles can benefit for this dual use scenario, then ole Albert's comments make a lot more sense (i.e. scenarios where system with less raw flops outperforms in real world output)

I see no reason why the HP pipe and the standard pipe on the PS4 cannot be running at the same time, in fact it seems like the entire point of having two pipes is to run them at the same time to provide the system with low latency rendering capability whilst the came is still running on the other pipe.

Also the pipes are not the same on the XB1, read the snippet of the interview I posted it states that one pipe is for rendering system content and the other is for title content.
 
There isn't a technical limitation which would prevent that. It's only that the currently defined system reservation is possibly only providing one to the title. However, it's also believed the current system reservation parameters are being revised. Just not sure when or how.
 
There isn't a technical limitation which would prevent that. It's only that the currently defined system reservation is possibly only providing one to the title. However, it's also believed the current system reservation parameters are being revised. Just not sure when or how.

The entire point of the second pipe is that the games don't use it. Its existence is purely to render system content, if the game suddenly starts using it then it serves no real point as the system then has to wait for the second pipes content to finish which would cause a delay in system content being rendered, exactly what they are trying to avoid.
 
T
The compute portion of the GPUs is stateless, but there is global state for the device and fixed-function contexts, including the control registers and system variables for the device.
The command processors and their system state are the bottleneck.
...
The game and app partitions would appear to be different clients to the GPU, but you'd want the responsiveness and performance of a passthrough mode. If there are dedicated front ends for each partition, then there's direct access for each side.

I see your point about GPU registers that are set in KMD only, but still, from an UMD point of view, as long as:

* your memory address are virtualized
* your command buffer queue(s) is in UMD and properly isolated
* GPU front-end is capable of not mixing different queues data access (i.e. fingerprinting with a sort of virtual GPU id)
* your GDS/LDS is subject to virtual memory addressing constraints.

...where would you escape? Or do I miss your point?
 
I see no reason why the HP pipe and the standard pipe on the PS4 cannot be running at the same time, in fact it seems like the entire point of having two pipes is to run them at the same time to provide the system with low latency rendering capability whilst the came is still running on the other pipe.

Also the pipes are not the same on the XB1, read the snippet of the interview I posted it states that one pipe is for rendering system content and the other is for title content.

Read it again:

The two render pipes can allow the hardware to render title content at high priority while concurrently rendering system content at low priority. The GPU hardware scheduler is designed to maximise throughput and automatically fills "holes" in the high-priority processing.

It's talking about high/low priority processing/rendering not high/low priority pipes. So we have three possibility here:

1) Pipes work together on high/low priority processing.
2) One of them is exclusively for high priority processing and the other one is exclusively for low priority processing (which system is one of them).
3) One of them is exclusively for title (with high priority processing) and the other one is exclusively for system (with low priority processing).

Edit: I was reading older thread about PS4 and it seems to me that their vision is a bit different. graphics command processors on PS4 can work concurrently but HP pipe uses reserved CUs for this purpose and the standard pipe uses other CUs. This is very different from filling the holes on higher priority processing on XB1. Maybe it's possible to have such functionality on PS4, too (I don't know) but its current functionality seems different from XB1. Or maybe I'm wrong.
 
Last edited by a moderator:
The entire point of the second pipe is that the games don't use it. Its existence is purely to render system content, if the game suddenly starts using it then it serves no real point as the system then has to wait for the second pipes content to finish which would cause a delay in system content being rendered, exactly what they are trying to avoid.

And perhaps that was the exactly their reasoning when they originally defined the system reservation. We don't know. My point is, they could just as easily decide to live with a potential 33ms delay and provide access to those hardware resources to the title for a given system state; the same as they have been doing in every previous generation console. Nothing which prevents that.
 
XB1 benefits from what sebbbi said at his post rigth now, But the difference is that in this example the system is using ROPs for rendering and title is using CUs for synchronous compute operations.

As I see it, both Sebbbi's and the Digital Foundry example are referring to the use of one graphics command processor (for the ROPs/shadow work) and one Compute Command Processor for the compute based work and so it should be no surprise that the XB1 is capable of that since as Sebbbi said, all moderns GPU's have that hardware capability, it's just not exposed in DX11 (but likely would be in DX11.X).

So to do this shouldn't require 2 graphics command processors anyway (if that's what's being suggested).

Obviously utilising spare compute resources via the compute queues when the rendering side of things is leaving CU's idle is a good thing but is there a suggestion that there is also some benefit of having multiple graphics command processors for rendering the game as well? If so, do we have any examples of how that would work?
 
Read it again:
Edit: I was reading older thread about PS4 and it seems to me that their vision is a bit different. graphics command processors on PS4 can work concurrently but HP pipe uses reserved CUs for this purpose and the standard pipe uses other CUs. This is very different from filling the holes on higher priority processing on XB1. Maybe it's possible to have such functionality on PS4, too (I don't know) but its current functionality seems different from XB1. Or maybe I'm wrong.

So this would mean that X1 could reduce or eliminate the GPU reservation for OS/system tasks?

This could have something to do with the awaited 10% increase of the GPU power (due to kinect reservation)?
 
Last edited by a moderator:
* your memory address are virtualized
Client accesses to IO memory and mapped registers would eventually go to the exact same address and control registers for the actual device and global context. That would require some intelligence on the GPU side (uncertain) or a virtual driver time-sharing things to handle it.
*GPU front-end is capable of not mixing different queues data access (i.e. fingerprinting with a sort of virtual GPU id)
My suspicion is that it's not, given how inflexible virtualized GPU solutions are.
GPU command processors are not on a very rapid evolutionary path.
Given the high performance penalty of non-passthrough AMD GPU virtualization, it's likely halting, flushing queues, and spinning up another context.

* your GDS/LDS is subject to virtual memory addressing constraints.
Perhaps in flat addressing mode they do. Allocations that are physically in on-die memory have their own internal addressing and bounds registers to protect access.
That doesn't help as much if the front end that sets their base registers and bounds is not similarly aware or doesn't host duplicated structures to house the control values of separate clients, however.
Duplicate front ends that don't need to talk to each other would be easier to isolate.
 
As I see it, both Sebbbi's and the Digital Foundry example are referring to the use of one graphics command processor (for the ROPs/shadow work) and one Compute Command Processor for the compute based work and so it should be no surprise that the XB1 is capable of that since as Sebbbi said, all moderns GPU's have that hardware capability, it's just not exposed in DX11 (but likely would be in DX11.X).

So to do this shouldn't require 2 graphics command processors anyway (if that's what's being suggested).

Obviously utilising spare compute resources via the compute queues when the rendering side of things is leaving CU's idle is a good thing but is there a suggestion that there is also some benefit of having multiple graphics command processors for rendering the game as well? If so, do we have any examples of how that would work?

There is a lot of room for tasks parallelism in a GPU but the idea of submitting draws from multiple threads in parallel simply doesn’t make any sense from the GPU architectures at this point. Everything will need to be serialized at some point and if applications don’t do it, the driver will have to do it. This is true until GPU architectures add support for multiple command processors which is not unrealistic in the future.

http://www.g-truc.net/doc/Candidate features for OpenGL 5.pdf (page 32)

I thought it was talking about graphics command processors, you can read the whole page and correct me. I am here to learn.

Also Digital Foundry example talks about synchronous compute operations. Is it possible to use asynchronous compute engines/queues for synchronous compute operations? There is no difference between them?!

So this would mean that X1 could reduce or eliminate the GPU reservation for OS/system tasks?

This could have something to do with the awaited 10% increase of the GPU power (due to kinect reservation)?

As far as I know that's the plan:

Andrew Goossen: One thing to keep in mind when looking at comparative game resolutions is that currently the Xbox One has a conservative 10 per cent time-sliced reservation on the GPU for system processing. This is used both for the GPGPU processing for Kinect and for the rendering of concurrent system content such as snap mode. The current reservation provides strong isolation between the title and the system and simplifies game development (strong isolation means that the system workloads, which are variable, won't perturb the performance of the game rendering). In the future, we plan to open up more options to developers to access this GPU reservation time while maintaining full system functionality.

http://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview
 
Last edited by a moderator:
I found an interesting patent from AMD (30 May 2013) which explain more or less our discussion about command processors. After reading this patent I think both graphics command processors on XB1 can work together on high/low priority processing and there will be no problem about context switching or QoS on XB1.

http://www.google.com/patents/US20130135327
 
there a suggestion that there is also some benefit of having multiple graphics command processors for rendering the game as well? If so, do we have any examples of how that would work?
Maybe put tasks like occlusion queries in a high priority queue. This assumes they can be rendered out of order relative to some other graphics tasks which is likely game specific.
 
I found an interesting patent from AMD (30 May 2013) which explain more or less our discussion about command processors. After reading this patent I think both graphics command processors on XB1 can work together on high/low priority processing and there will be no problem about context switching or QoS on XB1.

http://www.google.com/patents/US20130135327

Interesting read thanks. This is all way over my head and I only skimmed the patent but it seemed to me to be talking about multiple jobs being sent through the compute pipeline rather than the graphics pipeline. This section in particular caught my eye:

Since graphics pipeline 162 is generally a fixed function pipeline, it is difficult to save and restore its state, and as a result, the graphics pipeline 162 is difficult to context switch. Therefore, in most cases context switching, as discussed herein, does not pertain to context switching among graphics processes. An exception is for graphics work in shader core 122, which can be context switched.

I can understand how all the SP's in the shader core can be fully utilised by a combination of the 1 task in the graphics pipeline job and multiple compute pipeline tasks, which is what I took sebbbi's example to refer to and is also what I think Sony have been referring to in the past with asynchronous compute but I don't think this patent is referring to multiple simultaneous graphics pipeline tasks unless I'm reading it incorrectly.

3dcgi said:
Maybe put tasks like occlusion queries in a high priority queue. This assumes they can be rendered out of order relative to some other graphics tasks which is likely game specific.

Thanks, so traditionally I guess a task would travel through the different blocks of the graphics pipeline one at a time in sequence meaning that while the first blocks are active, the later blocks (e.g. ROPS) would be idle? And so the idea of mutliple graphics cmmand processors would in theory be to fill as much of the fixed function elements of the graphics pipeline simultaneously as possible?
 
Eurogamer article is talking about synchronous compute operations on the compute units (shader cores). Wording on Eurogamer leads me to think that they can do such a thing on XB1 because of 2nd graphics command processor not ACEs:

To facilitate this, in addition to asynchronous compute queues, the Xbox One hardware supports two concurrent render pipes. The two render pipes can allow the hardware to render title content at high priority while concurrently rendering system content at low priority. The GPU hardware scheduler is designed to maximise throughput and automatically fills "holes" in the high-priority processing.This can allow the system rendering to make use of the ROPs for fill, for example, while the title is simultaneously doing synchronous compute operations on the Compute Units.
http://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview

I heavily searched the internet to find a good explanation about ACEs and if they are good for graphics (shader compute for graphics purpose) or not. From sebbbi's post it seems that this is possible to do such a thing on current PC GPUs and both XB1 and PS4 are based on PC GPUs so all of them should have same ability around this concept:

Quote from the OpenGL 5 candidate feature list:

Good example of this is shadow map rendering. It is bound by fixed function hardware (ROPs and primitive engines) and uses very small amount of ALUs (simple vertex shader) and very small amount of bandwidth (compressed depth buffer output only, reads size optimized vertices that don't have UVs or tangents). This means that all TMUs and huge majority of the ALUs and bandwidth is just idling around while shadows get rendered. If you for example execute your compute shader based lighting simultaneously to shadow map rendering, you get it practically for free. Funny thing is that if this gets common, we will see games that are throttling more than Furmark, since the current GPU cooling designs just haven't been designed for constant near 100% GPU usage (all units doing productive work all the time).

Yet again the article (which sebbbi quoted a part of it in his post) suggests that this functionality isn't possible on current PC GPUs.

Southern Islands also has 2 Asynchronous Compute Engines (ACE). These engines allow efficient multi-tasking with independent scheduling and workgroup dispatch. These engines can run in parallel with the Draw Engine without any form of contention. OpenCL 1.2 exposes them with something called device partitioning. Sea Islands raised the number of ACE to 8.

There is a lot of room for tasks parallelism in a GPU but the idea of submitting draws from multiple threads in parallel simply doesn’t make any sense from the GPU architectures at this point. Everything will need to be serialized at some point and if applications don’t do it, the driver will have to do it. This is true until GPU architectures add support for multiple command processors which is not unrealistic in the future.

For example, having multiple command processors would allow rendering shadows at the same time as filling G-Buffers or shading the previous frame. Having such drastically different tasks live on the GPU at the same time could make a better usage of the GPU as both tasks will probably have different hardware bottleneck.
http://www.g-truc.net/doc/Candidate features for OpenGL 5.pdf (page 32)

While the article is mentioning ACEs, it says that adding support for multiple command processors will makes possible submitting draws from multiple threads in parallel (so I conclude that the article is talking about graphics command processors) At the end it gives us an example. Maybe each of this examples (which are the same to me) has some differences in details with other ones, we need informed people to discuss it more farther. I just tried to say my confusion to you.

Interesting read thanks. This is all way over my head and I only skimmed the patent but it seemed to me to be talking about multiple jobs being sent through the compute pipeline rather than the graphics pipeline. This section in particular caught my eye

Your welcome. ;)

Both graphics and compute pipeline are able to use shader cores for compute. Compute works on graphics command processor is semi-synchronous with graphics pipeline.

Although only a small amount of data may be provided as an input to graphics pipeline 162, this data will be amplified by the time it is provided as an output from graphics pipeline 162. Graphics pipeline 162 also includes DC 166 for counting through ranges within work-item groups received from CP pipeline 124 a. Compute work submitted through DC 166 is semi-synchronous with graphics pipeline 162.
Also the patent talks about simultaneously launching two or more tasks to resources within APD.

A disruption in the QoS occurs when all work-items are unable to access APD resources. Embodiments of the present invention facilitate efficiently and simultaneously launching two or more tasks to resources within APD 104, enabling all work-items to access various APD resources. In one embodiment, an APD input scheme enables all work-items to have access to the APD's resources in parallel by managing the APD's workload. When the APD's workload approaches maximum levels, (e.g., during attainment of maximum I/O rates), this APD input scheme assists in that otherwise unused processing resources can be simultaneously utilized in many scenarios. A serial input stream, for example, can be abstracted to appear as parallel simultaneous inputs to the APD.
http://www.google.com/patents/US20130135327
 
Last edited by a moderator:
Back
Top