AMD Kaveri APU features the Onion + bus like the PlayStation 4

S.G., I have to say that looking at Onion & Onion+ bus BWs, it seems very clear to me that Sony do not believe that much in this CPU-GPU cooperation.

I would expect most would draw the exact opposite conclusion. Especially since AMD themselves who most certainly do believe very much in CPU<->GPU cooperation are using the same implementation for their most recent design. Are you basing this solely on the amount of bandwidth available?
 
The point is will PS4 10Gb/s peak teoretical BW for coherent Onion+bus will enough for this? Considering also that when Onion+ is in use will leave only 10Gb/s BW for Onion ?

If we look at the numbers of the exemple of X1 memory usage on VGleaks (for sure taken from MS docs) it seems to me that developers of PS4 will have a hell of issues there.

Onion+ shared BW, in combination with the low level CPU could be "the Mother of all the PS4 Bottlenecks".

So let me get this right, we are now supposed to be worried about a 10GB/s bidirectional bus that let the CPU & GPU work together on compute being a bottleneck?



huma-without.jpg


huma-with.jpg




huma-diagram.jpg






Familiar Architecture, Future-Proofed

So what does Cerny really think the console will gain from this design approach? Longevity.

Cerny is convinced that in the coming years, developers will want to use the GPU for more than pushing graphics -- and believes he has determined a flexible and powerful solution to giving that to them. "The vision is using the GPU for graphics and compute simultaneously," he said. "Our belief is that by the middle of the PlayStation 4 console lifetime, asynchronous compute is a very large and important part of games technology."



Cerny envisions "a dozen programs running simultaneously on that GPU" -- using it to "perform physics computations, to perform collision calculations, to do ray tracing for audio."

But that vision created a major challenge: "Once we have this vision of asynchronous compute in the middle of the console lifecycle, the question then becomes, 'How do we create hardware to support it?'"

One barrier to this in a traditional PC hardware environment, he said, is communication between the CPU, GPU, and RAM. The PS4 architecture is designed to address that problem.

"A typical PC GPU has two buses," said Cerny. "There’s a bus the GPU uses to access VRAM, and there is a second bus that goes over the PCI Express that the GPU uses to access system memory. But whichever bus is used, the internal caches of the GPU become a significant barrier to CPU/GPU communication -- any time the GPU wants to read information the CPU wrote, or the GPU wants to write information so that the CPU can see it, time-consuming flushes of the GPU internal caches are required."

Enabling the Vision: How Sony Modified the Hardware

The three "major modifications" Sony did to the architecture to support this vision are as follows, in Cerny's words:

"First, we added another bus to the GPU that allows it to read directly from system memory or write directly to system memory, bypassing its own L1 and L2 caches. As a result, if the data that's being passed back and forth between CPU and GPU is small, you don't have issues with synchronization between them anymore. And by small, I just mean small in next-gen terms. We can pass almost 20 gigabytes a second down that bus. That's not very small in today’s terms -- it’s larger than the PCIe on most PCs!


"Next, to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the 'volatile' bit. You can then selectively mark all accesses by compute as 'volatile,' and when it's time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time -- in other words, it radically reduces the overhead of running compute and graphics together on the GPU."

Thirdly, said Cerny, "The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands -- the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system."

"The reason so many sources of compute work are needed is that it isn’t just game systems that will be using compute -- middleware will have a need for compute as well. And the middleware requests for work on the GPU will need to be properly blended with game requests, and then finally properly prioritized relative to the graphics on a moment-by-moment basis."

http://www.gamasutra.com/view/feature/191007/inside_the_playstation_4_with_mark_.php?page=2
 
S.G., I have to say that looking at Onion & Onion+ bus BWs, it seems very clear to me that Sony do not believe that much in this CPU-GPU cooperation.
You strike me as completely jumping to conclusions. How much BW is needed between CPU and GPU to enable good cooperation? That depends entirely on what data is being passed. It could well be that for 99% of workloads where you want close cooperation, you only need to send small amounts of data between processors. I dunno. If you think its inadequate, you must have an idea of how much data is being passed, so what's your theory? I certainly don't see reason to think that CPU<>GPU needs copious amounts of BW. If one of our devs weighs in and says that that the BW will be limiting then I'll start to worry. Until then, someone somewhere needs to go into detail about how CPU<>GPU interaction works and the type of data that is passed and the sort of requirements that has before I can even begin to try and hazard a guess as to how well designed the SOCs are.
 
You strike me as completely jumping to conclusions. How much BW is needed between CPU and GPU to enable good cooperation? That depends entirely on what data is being passed. It could well be that for 99% of workloads where you want close cooperation, you only need to send small amounts of data between processors. I dunno. If you think its inadequate, you must have an idea of how much data is being passed, so what's your theory? I certainly don't see reason to think that CPU<>GPU needs copious amounts of BW. If one of our devs weighs in and says that that the BW will be limiting then I'll start to worry. Until then, someone somewhere needs to go into detail about how CPU<>GPU interaction works and the type of data that is passed and the sort of requirements that has before I can even begin to try and hazard a guess as to how well designed the SOCs are.

I'd be surprised if it will be a concern.

The data sets will be small in GPU terms given the fact that the same set of data will also be used on the CPU. A GPGPU task which requires a lot of data to be read or written by the GPU is unlikely to have a component which will benefit from work being done on the CPU.

A lot of GPGPU tasks will also be generating inputs to more GPU render our GPGPU tasks and won't require coherencyon the CPU.

I'm not even sure I can think of many things which would benefit. It would have to be an array of data which can be processed in parallel. Wear patterns on tyres for a racing game, spring arrays for fluid surfaces, volumetric effect areas such as wind, height map deformation all spring to mind.
 
Im not saying it wont be a issue nor am i saying it will be a issue im saying what you do (for both consoles) depends on a wide range of factors which include the amount of bandwidth you need and how tight coupling there is between the CPU and the GPU for the shared data. It would be interesting to know what kind of latencies the L2 snooping busses have this may effect what type of access you want as well. [my guess would be that accessing the L2 via this bus would be somewhere in the realm of 50-100 cycles alone, but this could be way off the mark].

You're probably not far off. It's already a good idea to try to keep jobs with spacial coherence on a one of the two quad core CPUs to avoid copying data between the discrete caches there.
 
S.G., I have to say that looking at Onion & Onion+ bus BWs, it seems very clear to me that Sony do not believe that much in this CPU-GPU cooperation.

This seems to me another key point in which the 2 consolle manufacturers have followed 2 different paths / philosophy.

What do you believe the coherent bandwidth is for the Jaguar modules?
Would you consider it a decent assumption that if there is data meant to be system coherent going over Onion that it is likely to be used by the CPUs?

If that is the case, how far is Onion from the maximum bandwidth the CPUs can use?
 
My understanding:

The advantage is that the GPU and CPU can see the same kind of data, as opposed to the approach in the past, where CPU packed data into a format that the GPU understands (say, texture) and then unpack the result after the GPU does the processing. Not sure if anything is "made possible" this way, since it's just computation, but perhaps they would run faster now.

Thanks - I understand all that perhaps I should have been more clear. I am wondering outloud what sort of efficiencies might result as a product of coherent access which might make effects possible that otherwise might not be. In other words will coherency simply make the job of programming easier or will it allow us to do more processes due to efficiency or are we simply saving time for the developer.
 
My previous posts born after reading the "Durango Memory System Exemple" published by VGleaks.

http://www.vgleaks.com/durango-memory-system-example/

I will show where my thought came from:

-) Coherent Read/Write 25GB/s (out of 30GB/s peak)

And then:

-) CPU Module 0 (4 GB/S R; 4 GB/S W; 1,5 GB/s Write Combined)
out of (20,8 GB/S R; 20,8 GB/S W peak)

-) CPU Module 1 (4 GB/S R; 4 GB/S W; 1,5 GB/s Write Combined)
out of (20,8 GB/S R; 20,8 GB/S W peak)

-) Audio, Camera USB etc.. (3 GB/s W; 3GB/s R)
out of (9 GB/S R; 9 GB/S W peak)

So:
-) Northbridge <-> RAM 26GB/s Read & Write (out of 68GB/s peak).

Then I put this exemple of X1 memory usage against PS4 data:

-) X1 Coherent Read/Write 25GB/s (out of 30GB/s peak)
-) PS4 Onion+ 10GB/s peak (!)

-) X1 Northbridge <-> RAM 26GB/s Reand & Write (out of 68GB/s peak).
-) PS4 Onion 10GB/s peak (!) (as the other 10GB/s are shared with Onion+).

This is why I have raised my assumption that Sony seems not to give much importance to the overal CPU GPU cooperation.

I hope this is a design choice and not the necessity to take AMD components as they are. Because on PC things are radically different.

In truth, I am far more afraid of Onion 10GB/s BW than choerent 10GB/s of Onion+.
 
Correct me if I am wrong because I'm going off memory. But the ability of a fusion apu's gpu to write into the cacheable memory of the CPU has existed since llanos.
 
Shared address space means the CPU and GPU can reference the same data without having move that data from the CPU's domain to a GPU-safe location.

Moving the virtual memory support to match x86 and enhancing the IOMMU allow for a GPU side that is more capable of functioning as a proper memory client while also still being isolated enough so that it doesn't wreck the system.

For those that remember the AMD Barcelona TLB bug, it should point to how much confidence AMD needs to have in its implementation to bring hardware even remotely close to being a peer in the same memory pool as the CPUs and the OS.
Screwing up in that space, even a little, is a game over and historically GPUs have had nowhere near enough rigor to be trusted while at the same time the system didn't yet have good enough means to isolate itself with low enough overhead to be performant.
That means removing some significant bandwidth costs and latency penalties, and a big chunk of the complication and unreliability.

Coherence is something that makes sharing possible without invoking OS and driver code and spending milliseconds wiping down the GPU just to find out what the non-garbage value of a memory location is.
Vgleaks showed an L2 flush could take over 500 to thousands of cycles. That sounds pretty heavy.
Let's note that this is considered an improvement.
As big as the numbers are, sharing and synchronization primitives are now operating in the relative (wide-ranging) neighborhood of the sort of latencies hardware pipelines actually run at, instead of signficant fractions of a second.

Coherent sharing between CPU caches has performance overhead.
Coherent sharing between the GPU and CPU still has a lot more, but it might make it practical for the subset of algorithms that have chunks of significant parallelism that are hidden between stretches of more constrained or complex code.
Non-coherent has much more overhead because the data is effectively garbage until you basically stall everything and purge the GPU to be safe.

There's a succession of increasingly more garbage options as you go to unshared memory, and then physically separate and unshared, etc. I'm not getting into the difficulties software faces trying to handle increasingly garbage options.

To get an idea of what happens when a system doesn't take care to check if its data is garbage, see AMD's current Eyefinity Xfire woes to get an idea of how much this platform still has to prove.
Making pervasive use of the GPU beyond its traditional role is a comprehensive problem spanning system, hardware, software, and development architecture.


CPU coherence, and the whole span of an industry that uses it, is known to have the potential to be not-garbage.
GPU evolution has painfully dragged compute to the point that AMD is either confident or desperate enough to assert that it is not garbage. I suppose we'll find out.
Even if this turns out to be a functional paradigm, which I think at least in some cases it can be, it remains to be seen how much Sony, AMD, or $HSAmember can prove that it is compelling.

If people are wondering why I used the word "garbage" a lot, well...
 
Shared address space means the CPU and GPU can reference the same data without having move that data from the CPU's domain to a GPU-safe location.

Moving the virtual memory support to match x86 and enhancing the IOMMU allow for a GPU side that is more capable of functioning as a proper memory client while also still being isolated enough so that it doesn't wreck the system.

For those that remember the AMD Barcelona TLB bug, it should point to how much confidence AMD needs to have in its implementation to bring hardware even remotely close to being a peer in the same memory pool as the CPUs and the OS.
Screwing up in that space, even a little, is a game over and historically GPUs have had nowhere near enough rigor to be trusted while at the same time the system didn't yet have good enough means to isolate itself with low enough overhead to be performant.
That means removing some significant bandwidth costs and latency penalties, and a big chunk of the complication and unreliability.

Coherence is something that makes sharing possible without invoking OS and driver code and spending milliseconds wiping down the GPU just to find out what the non-garbage value of a memory location is.
Vgleaks showed an L2 flush could take over 500 to thousands of cycles. That sounds pretty heavy.
Let's note that this is considered an improvement.
As big as the numbers are, sharing and synchronization primitives are now operating in the relative (wide-ranging) neighborhood of the sort of latencies hardware pipelines actually run at, instead of signficant fractions of a second.

Coherent sharing between CPU caches has performance overhead.
Coherent sharing between the GPU and CPU still has a lot more, but it might make it practical for the subset of algorithms that have chunks of significant parallelism that are hidden between stretches of more constrained or complex code.
Non-coherent has much more overhead because the data is effectively garbage until you basically stall everything and purge the GPU to be safe.

There's a succession of increasingly more garbage options as you go to unshared memory, and then physically separate and unshared, etc. I'm not getting into the difficulties software faces trying to handle increasingly garbage options.

To get an idea of what happens when a system doesn't take care to check if its data is garbage, see AMD's current Eyefinity Xfire woes to get an idea of how much this platform still has to prove.
Making pervasive use of the GPU beyond its traditional role is a comprehensive problem spanning system, hardware, software, and development architecture.


CPU coherence, and the whole span of an industry that uses it, is known to have the potential to be not-garbage.
GPU evolution has painfully dragged compute to the point that AMD is either confident or desperate enough to assert that it is not garbage. I suppose we'll find out.
Even if this turns out to be a functional paradigm, which I think at least in some cases it can be, it remains to be seen how much Sony, AMD, or $HSAmember can prove that it is compelling.

If people are wondering why I used the word "garbage" a lot, well...

Great response, this clears a lot up for me. I was wondering what 'the catch' might be so to speak and you nailed with the overhead comments.
 
Then I put this exemple of X1 memory usage against PS4 data:

-) X1 Coherent Read/Write 25GB/s (out of 30GB/s peak)
-) PS4 Onion+ 10GB/s peak (!)
It's 10GB/s in both directions - or 20GB/s in Read/Write terms if you simplify things.

-) X1 Northbridge <-> RAM 26GB/s Reand & Write (out of 68GB/s peak).
-) PS4 Onion 10GB/s peak (!) (as the other 10GB/s are shared with Onion+).
Onion and Onion+ are the same, coherent bus. Onion+ is a virtual bus created by a hardware "flag".
The PS4 CPU has 20GB/s of "non-vegetable" bandwidth.
 
Hmm... there should be 2 concepts:

(1) The ability to share addresses between the CPU and GPU. This is sufficient to allow the CPU and GPU to cooperate without excessive copying for the most part. To maximize parallelism, programmers typically don't want to shackle one core to another. So each core will be given pretty standalone data chunk to run at full speed. Incoherent memory access is ok here, in fact it should be faster this way.

(2) The ability to make sure the values read from these addresses are consistent (between the CPU and GPU). This is typically helpful for implementing synchronization primitives between the cores, sending input values (or input pointers) to another core, or receiving results (or result pointers) from another core. I suspect it's not used during the actual work, except during scheduling or somewhat serialized operations where you want to pass depended data between jobs.
 
It's 10GB/s in both directions - or 20GB/s in Read/Write terms if you simplify things.


Onion and Onion+ are the same, coherent bus. Onion+ is a virtual bus created by a hardware "flag".
The PS4 CPU has 20GB/s of "non-vegetable" bandwidth.

Exactly the point is that Onion+ shared 10GB/s BW of Onion, leaving Onion with "only" 10GB/s R/W (so for ex. 5GB/s R and 5GB/s W).

This "only" 10GB/s R/W of Onion could be a concern. I just ask the developers here.
 
Should be fine. ^_^

It will slow things down if you use coherent access too much. The programmer would want the GPU and CPU to run independently on their own workload as much as possible, with checkpoints and synchronization barriers sprinkled sparingly throughout the app/game.
 
Exactly the point is that Onion+ shared 10GB/s BW of Onion, leaving Onion with "only" 10GB/s R/W (so for ex. 5GB/s R and 5GB/s W).

This "only" 10GB/s R/W of Onion could be a concern. I just ask the developers here.

Onion+ and Onion run over the same bus. If either is running at full tilt, the other gets nothing.

Unless the CPU section has more than 20GB/s of coherent bandwidth, what does a wider Onion bus give?
 
In the PS4 model, the developer is supposed to delegate more work to the GPU anyway. I would expect more memory activities there for both graphics and compute type work.

EDIT: If the developer find that he's doing constant, lockstep data sharing between the CPU and GPU, he may refactor the code to minimize these dependencies.
 
Thanks - I understand all that perhaps I should have been more clear. I am wondering outloud what sort of efficiencies might result as a product of coherent access which might make effects possible that otherwise might not be. In other words will coherency simply make the job of programming easier or will it allow us to do more processes due to efficiency or are we simply saving time for the developer.

Hmm, I feel it's a little bit of both...it's easier to code for it with coherent access, but at the same time easier doesn't mean it's easy because one still needs to write extra code to make it work. There are still a lot of synchronization work needed to make sure this runs well, so I'd probably still prefer a beefy CPU.
 
Back
Top