DirectX 12: The future of it within the console gaming space (specifically the XB1)

3dcgi · May 8, 2014

pjbliverpool said:
Thanks, so traditionally I guess a task would travel through the different blocks of the graphics pipeline one at a time in sequence meaning that while the first blocks are active, the later blocks (e.g. ROPS) would be idle? And so the idea of mutliple graphics cmmand processors would in theory be to fill as much of the fixed function elements of the graphics pipeline simultaneously as possible?

Not quite. In a typical pipeline the oldest draw command enters the top of the pipe and is immediately followed by the second draw. Multiple draws can be in flight at the same time so depending on the size of a draw command the ROPs will be working on the same draw as the geometry blocks or a different draw.

The front end of a GPU can have a back log of work as the game/driver is often ahead of the GPU. If the game decides to perform a high priority operation it might wish to skip to the front of the line.

It's a different situation from async compute which can more efficiently fill the shaders because it bypasses the fixed function graphics pipeline.

Edit: Note that I'm briefly describing a second graphics queue or command processor. If portions of the graphics pipe are duplicated the results might be different.

pjbliverpool · May 8, 2014

3dcgi said:
Not quite. In a typical pipeline the oldest draw command enters the top of the pipe and is immediately followed by the second draw. Multiple draws can be in flight at the same time so depending on the size of a draw command the ROPs will be working on the same draw as the geometry blocks or a different draw.

The front end of a GPU can have a back log of work as the game/driver is often ahead of the GPU. If the game decides to perform a high priority operation it might wish to skip to the front of the line.

It's a different situation from async compute which can more efficiently fill the shaders because it bypasses the fixed function graphics pipeline.

Edit: Note that I'm briefly describing a second graphics queue or command processor. If portions of the graphics pipe are duplicated the results might be different.

Cheers, that's a bit clearer now.

pjbliverpool · May 8, 2014

mosen said:
While the article is mentioning ACEs, it says that adding support for multiple command processors will makes possible submitting draws from multiple threads in parallel (so I conclude that the article is talking about graphics command processors) At the end it gives us an example. Maybe each of this examples (which are the same to me) has some differences in details with other ones, we need informed people to discuss it more farther. I just tried to say my confusion to you.

Thanks, yes when you put it that way the article does seem to be talking about multiple GPC's (not possible with current PC hardware) as opposed to Sebbbi's example of using the graphics pipeline in combination with the compute pipeline (as would be possible with current PC hardware if it were exposed by the API).

mosen · May 8, 2014

pjbliverpool said:
Thanks, yes when you put it that way the article does seem to be talking about multiple GPC's (not possible with current PC hardware) as opposed to Sebbbi's example of using the graphics pipeline in combination with the compute pipeline (as would be possible with current PC hardware if it were exposed by the API).

So, the article was talking about using idle fixed functions hardwares in graphics pipeline for other purpose than rendering main frame (by using more than one graphics command processor) but sebbbi's example was about using compute shaders for lighting (compute) alongside shadow map rendering (which uses fixed function hardwares)? right?

Assuming that my understanding is correct, the first question is which solution is better and by what margin? And the next question is that only by adding more graphics command processors the first example would be doable, or current fixed function hardwares design problems with context switch remains a problem to solve?

Thanks.

pjbliverpool · May 8, 2014

mosen said:
Assuming that my understanding is correct, the first question is which solution is better and by what margin? And the next question is that only by adding more graphics command processors the first example would be doable, or current fixed function hardwares design problems with context switch remains a problem to solve?

Thanks.

If correct then I don't see why it would be an either/or situation. Surely you could have multiple Graphics command processors submitting jobs to the graphics pipeline and mutlilple ACE's submitting jobs to the compute pipeline at the same time?

Although if one had to say which was going to bring more benefit it seems likely that the mutliple compute tasks would be the winner given the nature of the huge homogeneous shader array, relative ease of context switching and the simple fact that GPU's developed with that option first suggesting it's the lower hanging fruit.

mosen · May 9, 2014

pjbliverpool said:
If correct then I don't see why it would be an either/or situation. Surely you could have multiple Graphics command processors submitting jobs to the graphics pipeline and mutlilple ACE's submitting jobs to the compute pipeline at the same time?

I don't know if it's possible to submit jobs for each fixed function HW in graphics pipeline separately (without interfering with other FFHWs jobs) or not. So I thought maybe just adding more graphics command processors isn't enough for doing such a thing.

Although if one had to say which was going to bring more benefit it seems likely that the mutliple compute tasks would be the winner given the nature of the huge homogeneous shader array, relative ease of context switching and the simple fact that GPU's developed with that option first suggesting it's the lower hanging fruit.

Please correct me if I'm wrong, but as far as I learned the compute shader could be used for some graphics related jobs like lighting (and of course GPGPU) but it seems that even using compute shaders for this purpose needs to be synchronous with graphics pipeline (I'm not sure, but for example you can see DICE technique here). While ACEs jobs/tasks are asynchronous with graphics pipeline, so they are good for doing non-graphics (or works that aren't directly related to graphics) jobs like "post-processing effects, decompression, collision detection, ray casting of Audio, physics Simulation". My understanding is that only the compute pipe of graphics command processor, which is semi-synchronous with graphics pipeline, could be used for compute shaders (for graphics purpose).

So considering all I said and assuming that I'm correct, what the article suggested is that you will have more freedom on GPU resources and you can use them as you want. If the graphics is the main goal you could go for that and if GPGPU or asynchronous computing is the problem, you could use that (or even use both of them at the same time). You can bend your resources as you want, there will be no barrier.

I reach to this point of view by reading this article:
http://gearnuke.com/mantle-to-bring-ps4-asynchronous-fine-grain-compute-to-pc/
Both Cerny and Katsman are talking about asynchronous compute in parallel with the normal graphics works.

3dcgi · May 9, 2014

Async compute is useful for graphics. You can run these jobs while unrelated graphics work is done using the graphics pipe. At some point the game might synchronize if the compute work isn't done when its output is needed.

So async compute starts and finishes asynchronously compared to normal graphics rendering, but that doesn't mean the application can't use it for graphics.

DX11 doesn't support async compute so it wasn't available to DICE when they developed BF3.

Graham · May 10, 2014

Yup. Async compute provides a large number of opportunities to better utilize a GPU's compute units. There are numerous points in a rendering pipeline where the compute units go underutilized, the best example being shadow map rendering. For example, while rendering shadow maps you could be running particle simulations.

Being able to kick off a compute job while other tasks process can effectively make the compute job close to 'free'. The only catch is you have to ensure the gpu will sync the compute job before it tries to use the results, but that's pretty trivial.

Something I'm keen to investigate is that most post processing occurs as compute shaders, which could be run in parallel with the z-prepass and shadow map generation of the following frame.

mosen · May 10, 2014

Thanks for response. I was totally wrong !! :|

sebbbi · May 10, 2014

Graham said:
Something I'm keen to investigate is that most post processing occurs as compute shaders, which could be run in parallel with the z-prepass and shadow map generation of the following frame.

You'd get even bigger gains if you run your whole lighting pipeline asynchronously (deferred rendering with compute shader based lighting). G-buffer rendering is also often fill bound (two render targets require 2 ROP cycles/pixel, four RTs require 4 ROP cycles/pixel), or in some cases (super tight g-buffer setups) g-buffer stage can even be primitive setup bound. You should do asynchronous lighting during the g-buffer rendering as well. Your example of interleaving operations of two frames is the straightforward "easy" case. However if you can't afford the extra frame of latency, you can split the render target to large tiles (for example four tiles). This way you can be lighting and post processing the tile N while you simultaneously rasterize the tile N+1 g-buffer.

Allandor · May 10, 2014

Strange said:
Point is: Anything already running on XB1 is probably capable of any feature DX12 will allow other PCs to do. (unless the feature is terribly inefficient for the XB1 HW, but then DX12 won't help either) And XB1 probably does it more "efficiently" than any PC running DX12, being more bare bones than DX12.

But then if the API running on XB1 is shitty as hell, I could envision DX12 helping, but that's assuming Microsoft is more fail than we can possibly imagine in developing the XB1.

They already told at gdc that not every feature is available for xbox one, right now. But they are coming with future software updates. Some are already in the current xbox api but it is still dx11.1/2 based. That means maybe less overhead, but not at a dx12 level. Also no multithreaded drawcalls so far.

Shifty Geezer · May 10, 2014

Allandor said:
That means maybe less overhead,...

What sort of DX 11 overheads does XB1 have and why?

mosen · May 10, 2014

Shifty Geezer said:
What sort of DX 11 overheads does XB1 have and why?

Features that aren't on XB1:

1) Pipeline State Objects (PSOs).

2) Resource Binding.

These features will be available later on XB1.

mosen · May 10, 2014

I asked some question from Christophe Riccio and fortunately he answered my questions:

mosen said:
Hi,

In your article (page 32) you wrote:

“Southern Islands also has 2 Asynchronous Compute Engines (ACE). These engines allow efficient multi-tasking with independent scheduling and workgroup dispatch. These engines can run in parallel with the Draw Engine without any form of contention. OpenCL 1.2 exposes them with something called device partitioning. Sea Islands raised the number of ACE to 8.

There is a lot of room for tasks parallelism in a GPU but the idea of submitting draws from multiple threads in parallel simply doesn't make any sense from the GPU architectures at this point. Everything will need to be serialized at some point and if applications don’t do it, the driver will have to do it. This is true until GPU architectures add support for multiple command processors which is not unrealistic in the future. ”

And you said at the end of the next paragraph that“Having such drastically different tasks live on the GPU at the same time could make a better usage of the GPU as both tasks will probably have different hardware bottleneck.”.

Can you explain more about your vision? Using idle ALUs/resources like what is now possible with async compute (Mantle/BF4) for free is what you were trying to say? Or there will be new possibilities/rendering techniques by having more command processors on GPUs alongside more efficiency ?

And in your vision, adding more command processors to GPUs needs fundamental changes to fixed function hardwares in graphics pipeline or not?

I would appreciate if you answer my questions.

And here is his response:

Christophe Riccio said:
I don't think it would allow new rendering techniques but it could allow thing we could do in cross fire / SLI like rendering simultaneously two frames at a time. Also it could allows to rendering independent rendering passes simultaneously which could provide a better utilization of the hardware as typically each rendering pass has different GPU hardware bottlenecks. For example if we could do the rendering of the shadows and the some shading simultaneously.

I am not completely sure about the consequence for the fixed function hardware. There are already different task live in a GPU. Most probably the multi command processor would have to share the pool of graphics context. (8 in Southern Islands if I remember correctly). I am more concerned here by the CPU side usage: We could create 2 OpenGL contexts and actually submit different commands simultaneously on different threads.

Overall, just duplicating the command processor seems an expensive idea to me in term of transistors count and an idea that doesn't really scale across GPU architectures / low-end vs high-end GPUs. I would rather want the IHV focus on things that will have a real impact for the rendering like programmable blending and multi draw indirect. I also think graphics programmers should thing twice when they want to submit multi command buffer simultaneously because it doesn't make any sense from a hardware design point of view unless if we add multi command processors in the future.

Thanks,
Christophe

Starx · May 10, 2014

Boyd Multerer:

Part of it is the obvious one where everyone’s still getting to know this hardware and they’ll learn to optimise it. Part of it is less obvious, in that we focused a lot of our energy on framerate. And I think we have a consistently better framerate story that we can tell.

The GPUs are really complicated beasts this time around. In the Xbox 360 era, getting the most performance out of the GPU was all about ordering the instructions coming into your shader. It was all about hand-tweaking the order to get the maximum performance. In this era, that’s important – but it’s not nearly as important as getting all your data structures right so that you’re getting maximum bandwidth usage across all the different buffers. So it’s relatively easy to get portions of the GPU to stall. You have to have it constantly being fed

More on:
http://www.totalxbox.com/74852/feat...softs-boyd-multerer-on-creating-the-xbox-one/

Scott_Arm · May 11, 2014

http://channel9.msdn.com/Events/Build/2014/3-564

This video is the best starting point to understanding DX12. It sounds like it's been significantly overhauled. How much of that will lead to better performance on X1, who knows. The question is how similar the API on X1 is to DX12. Without some insight from devs (which we won't get because of NDAs) there's really no way of knowing.

liquidboy · May 12, 2014

Interesting blog post on a veteran OpenGL dev calling out problems he sees with OpenGL and siteing how in his opinion "Mantle/D3D12 are going to most likely eat it for lunch soon" and "Mantle and D3D12 are going to thoroughly leave GL behind"

"Things that drive me nuts about OpenGL " - http://richg42.blogspot.jp/2014/05/things-that-drive-me-nuts-about-opengl.html

Airon · May 13, 2014

mosen said:
I asked some question from Christophe Riccio and fortunately he answered my questions:
Originally Posted by Christophe Riccio
"... but it could allow thing we could do in cross fire / SLI like rendering simultaneously two frames at a time. Also it could allows to rendering independent rendering passes simultaneously which could provide a better utilization of the hardware as typically each rendering pass has different GPU hardware bottlenecks. For example if we could do the rendering of the shadows and the some shading simultaneously. "

I start to believe that the second graphic command processor will be something very intersting! Who knows if for its implementation we will have to wait for the full DX12 update...

iroboto · May 13, 2014

Airon said:
I start to believe that the second graphic command processor will be something very intersting! Who knows if for its implementation we will have to wait for the full DX12 update...

I hope so too. If people really feel that console held back PC graphics and gaming potential in the last couple of years, starting off at DX11.2 and having no growth or change in functionality from there is likely going to feel even worse.

Betanumerical · May 13, 2014

Airon said:
I start to believe that the second graphic command processor will be something very intersting! Who knows if for its implementation we will have to wait for the full DX12 update...

From what i've read the second command processor atm is so that "The system has its own rendering pipe". Maybe that'll change in the future maybe it won't, but from what we know atm it isn't used by the game.

DirectX 12: The future of it within the console gaming space (specifically the XB1)

3dcgi

pjbliverpool

B3D Scallywag

pjbliverpool

B3D Scallywag

mosen

pjbliverpool

B3D Scallywag

mosen

3dcgi

Graham

Hello :-)

mosen

sebbbi

Allandor

Shifty Geezer

uber-Troll!

mosen

mosen

Starx

Scott_Arm

liquidboy

Airon

iroboto

Daft Funk

Betanumerical

Similar threads