The graphics processor can write directly to system memory, bypassing its own L1 and L2 caches, which greatly simplifies data sync between graphics processor and central processor. Mr. Cerny claims that a special bus with around 20GB/s bandwidth is used for such direct reads and writes. To support the case where developer wants to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, Sony has added a “volatile” bit in the tags of the cache lines. Developers can then selectively mark all accesses by compute as “volatile”, and when it is time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time. In general, the technique radically reduces the overhead of running compute and graphics together on the GPU. The original AMD GCN architecture allows one source of graphics commands, and two sources of compute commands. For PS4, Sony worked with AMD to increase the limit to 64 sources of compute commands. If a game developer has some asynchronous compute you want to perform, he should put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system. Sony believes that not only games, but also various middleware will use GPU computing, which is why requests from different software clients need to be properly blended and then properly prioritized.
“If you look at the portion of the GPU available to compute throughout the frame, it varies dramatically from instant to instant. For example, something like opaque shadow map rendering doesn't even use a pixel shader, it is entirely done by vertex shaders and the rasterization hardware – so graphics are not using most of the 1.8TFLOPS of ALU available in the CUs. Times like that during the game frame are an opportunity to say, 'Okay, all that compute you wanted to do, turn it up to 11 now',” said Mr. Cerny.