CPU and GPU submissions buffered? *spawn*

McHuj

Veteran
Supporter
What I don't understand is why isn't the CPU and GPU double buffered on frames: CPU process frame N, while GPU is rendering N-1, that way both have the full frame time to do their work. Obviously that introduces a frame latency, but at a high frame rate, I don't know if that would be perceptible? maybe?
 
What I don't understand is why isn't the CPU and GPU double buffered on frames: CPU process frame N, while GPU is rendering N-1, that way both have the full frame time to do their work. Obviously that introduces a frame latency, but at a high frame rate, I don't know if that would be perceptible? maybe?

We are slightly OT but worth spinning off. I would also like to know more from anyone who could provide insight into this, how swap chains or buffering is accounted for in processing time.
 
What I don't understand is why isn't the CPU and GPU double buffered on frames: CPU process frame N, while GPU is rendering N-1, that way both have the full frame time to do their work. Obviously that introduces a frame latency, but at a high frame rate, I don't know if that would be perceptible? maybe?

That is probably the most popular way of handling GPU submission on consoles, since it's simple and lets you fully parallelize the CPU and GPU. Some games use more complex setups where the CPU will send partial submissions to the GPU (D3D drivers often do this on Windows). Other games might even add an additional frame of latency in order to give the CPU more than a single frame's worth of time to complete its work.

The situation iroboto describes (CPU and GPU working in lockstep) isn't commonly used in games as far as I know, since it means one processor just stalls while the other is working. Even with both processors working in a parallel you can still have one bottleneck the other, since you'll typically have one wait for the other if it's running slow. In the ideal case both processors complete in the target frame time (16.6ms for 60fps, 33.3ms for 30fps) and you only wait for VSYNC (if you're using it).
 
That is probably the most popular way of handling GPU submission on consoles, since it's simple and lets you fully parallelize the CPU and GPU. Some games use more complex setups where the CPU will send partial submissions to the GPU (D3D drivers often do this on Windows). Other games might even add an additional frame of latency in order to give the CPU more than a single frame's worth of time to complete its work.

The situation iroboto describes (CPU and GPU working in lockstep) isn't commonly used in games as far as I know, since it means one processor just stalls while the other is working. Even with both processors working in a parallel you can still have one bottleneck the other, since you'll typically have one wait for the other if it's running slow. In the ideal case both processors complete in the target frame time (16.6ms for 60fps, 33.3ms for 30fps) and you only wait for VSYNC (if you're using it).

Thanks MJP. It didn't occur to me that this was occurring. So basically by the time the CPU finishes running through everything in Render() and all the draw calls were made, it's just going to back into Update() and continue forward. If the GPU completes the work in time it'll write into 1 of 2/3 back buffers and set the pointer there to draw the screen and swap as necessary. And while it's been doing this the CPU is actually still going forward.

If the slideshow is a result of the CPU being too slow to feed the GPU, what are we seeing as players if the GPU is far behind the CPU like multiple frames? Do we get the weird... speed up effect?
 
By default PC DirectX is allowed to buffer up to 3 frames. Draw calls just add commands to the GPU command buffer and immeditely return until the GPU buffers are full or the maximum latency is exceeded. If buffers are full / maximum latency is exceeded, the draw calls will block until the GPU execution has proceeded. If you do timing on CPU side (in the main render thread), you will notice quite noisy results because of this. Also many PC GPU drivers have a separate thread for processing draw calls and translating them / doing resource management for GPU memory, this adds some fluctuation because of thread contention (assuming your game is using enough threads).
 
I recall Turn 10 did a double buffer technique in Forza 2 to push it up to 60 fps, I think they did a presentation/slide on it, but couldn't seem to find it now. I'm under the impression that this is fairly common with game engines today.
 
Thanks for the responses everyone. To follow up with McHuj question is the lag perceptible when triple buffered? For 60fps you are visually delayed by 49ms and for 30 FPS nearly 100ms.

Sebbbi in a game like trials fusion inputs and FPS are critical to completing some courses did you do something to processing player inputs to remove some of the buffer delay?

Is collision detection as well as audio when triple buffered noticeable by players?
 
Noticeable input lag depends on game. Easy to see on the PC because of the variable performance you can play with. In FPS or any that requires precise aiming/location in tandem with the input device (mouse) you will notice tracking delay and in a competitive environment people will elect to turn buffering off. But in general you can definitely get away with it in most genres without it being noticeable to the player.
 
Adding to this, circa DirectX 3 many games did accidentally render in lock step, there were API's to wait on the GPU, and many games did.
When DX5 was released, they took out all of he synchronization primitives because the result of synchronizing was a lot of idle GPU time.

I doubt there are many games that intentionally sync the GPU and CPU anymore, you can still do it "accidentally" by locking a surface or using one of the read back API's, but analysis tools are much better now, so identifying these is much easier.
 
I think there are some - those that makes heavy use of GPGPU needs a way to synchronize CPU and GPU.

Using GPGPU technique to process the data in place of the CPU for each frame doesn't change that each frame's data need to be ready before rendering, I don't think they'd conflict each other. It might create a resource issue though, not sure how smart the schedulers are currently.
 
I think there are some - those that makes heavy use of GPGPU needs a way to synchronize CPU and GPU.
Chances are virtually nil these days. With multiple CPU cores, multiple threads, even multiple GPU threads, nothing is going to be sitting around waiting. Multiple jobs will be available when waiting for something else to finish. Any stalls will be bugs, not design choices. We've even had devs on this board talk about frame N+1 type calculations, so you in any given frame rendering period be processing current frame requirements, next frame requirements, rendering current frame and even rendering parts for the next frame.
 
If dealing with standard API stuff, there are number of places where they'd have to make such a trip, such as handling occlusion queries (edit: and then routing the results back)--which requires CPU intervention through commands put through the runtime and driver. Because of the massive and unpredictable latencies, it simply does not get done on the current frame or possibly multiple frames.

The more integrated architectures or low-level APIs would be lower latency, or remove outside intervention.

In relative terms, though, it wouldn't be considered heavy, at least in terms of frequency.
For CPU to CPU communication, getting data to use in the worst case would be main memory latency, so over a hundred of cycles, with in-cache access taking a handful of cycles. It's still used judiciously.
Doing the same thing with the latest APUs by sending a command to a GPU buffer to make the results of compute available without using Onion+ would according to Vgleaks have a worst-case of tens of thousands of GPU cycles.
The predictability of the GPU's queueing is not that great at present, though. That could still make the case for buying 33ms or so by working on previous frame data.
With Onion+, a bandwidth-restricted amount of data could be sent from the GPU to main memory and then back to a requesting CPU after some multiple hundreds of cycles. It's a minority of the data being processed.

The amount of synchronization between the two sides would be commensurate to how debilitating using it would be.
The best GPGPU methods are painful at present, and are used sparingly. They just aren't horrific anymore.

If long-running compute that handles itself mostly on the GPU with occasional runs through Onion+ can be done, it might lead to a somewhat freer interplay with the CPU because it should remove much of the multi-frame queueing latencies that can accumulate if the GPU is under load. Presentations from Sucker Punch on the PS4 indicate this is still troublesome. For PC drivers, it might be an application killer, since such a kernel isn't one that would conclude in time for a driver's timeout/freakout limit.
 
Last edited by a moderator:
Back
Top