How does PS4/XB1/AMD HSA RAM access actually work?

Shifty Geezer

uber-Troll!
Moderator
Legend
Following on from other discussions, could some knowledgeable person care to explain exactly how memory is addressed on these devices? In split RAM pools it was obviously a case of GPU reads GPU RAM, CPU reads CPU RAM. What's not clear (for some of us ;)) about the HSA unified memory is what components can access what RAM with what caveats. It would appear that it's not a case of complete, arbitrary RAM address access by either component over their own buses.
 
I've been wondering about the same since that article was posted. From my limited perspective I've come to see it as a http://en.wikipedia.org/wiki/Cache_coherence problem. There are two ways to "map" the memory.

First, like on a PC, you have a "ships in the night" scenario, the CPU can access all of the memory (kinda like memory maping a PCIe GPU memory into a process address space) and the GPU can access the same memory, but they don't know what each other is doing. Any kind of synchronization will have to be made explicit. Any data in caches (either CPU or GPU) might become stale and no longer reflect what the other client has written to RAM after the cache was loaded. This is the "fast" mode, typical in GPUs.

Second, like a multi-core system, a coherent representation of memory is maintained on both devices. If a GPU compute program writes to memory, and that write is still in the GPU cache, the CPU won't blindly load the "bad" data in RAM, it will "ask" the GPU if it has ("owns") that data and to pretty please send it. This is the "slow" mode, typical in multi-core/multi-socket CPUs. Every memory access has to synchronize with the coherent partners (GPU, both jaguar modules), the devices can't assume they "own" the memory or that the data in RAM is "current".
Is this "now you can use synchronization primitives between the devices, for example both the CPU cores and the GPU cores all reading and writing the same list of particles at the same time." correct?
 
Can there not be an intermediate with the ability to flag chunks of address space as "currently in work", in which case a synchronization process is initiated, or "not currently in work" in which case any device can read/write from that space in the fast manner as described above, ignoring coherency issues on other devices accessing the ram? What address space granularity would be needed (not a flag for each bit, obviously, but what size chunks?), what would the transistor budget be for that sort of table, and is that feasible? Or is that in any way how HSA/HUMA/UMA whatever is intended to function?
 
Second, like a multi-core system, a coherent representation of memory is maintained on both devices.
Are there even two devices? Is taht the topology we have. I kinda assumed the idea of HSA would have involved a single memory controller to a single RAM pool. The idea of the CPU and GPU discretely working from the same RAM, and bumping into each other as they go, seems an awkward complication I don't understand the value in.
 
Perhaps the high level API will maintain the illusion of a unified and uniform memory. But developers can use the low level API to have more fine-grained control of the caches.
 
Can there not be an intermediate with the ability to flag chunks of address space as "currently in work", in which case a synchronization process is initiated, or "not currently in work" in which case any device can read/write from that space in the fast manner as described above, ignoring coherency issues on other devices accessing the ram? What address space granularity would be needed (not a flag for each bit, obviously, but what size chunks?), what would the transistor budget be for that sort of table, and is that feasible? Or is that in any way how HSA/HUMA/UMA whatever is intended to function?

I assume there is some kind of page-like mechanism, not unlike ye old graphics aperture mechanism, which does just mark certain areas as fast or coherent. Granularity should be quite coarse. Marking an area of say 32MB-2GB as shared and then allocating whatever thingies you need to be shared between the GPU and CPU into that window. You wouldn't need more than 4-8 such windows, and could possible work with just one. Transistor overhead? Well, I don't know, I assume quite small. I hope the local heavyweights chime in with a clearer picture, I'm not an expert on this stuff. I don't even know how that graphics aperture thing worked, lol.


Are there even two devices? Is taht the topology we have. I kinda assumed the idea of HSA would have involved a single memory controller to a single RAM pool. The idea of the CPU and GPU discretely working from the same RAM, and bumping into each other as they go, seems an awkward complication I don't understand the value in.

That's the way, f.ex., NUMA multisocket servers work, where each socket has a multi-core CPU with a memory controller. I don't know where the single memory controller IP "owning" the DDR5 on the PS4 is located, or the topology of a typical commercial APU.

The value is in the shared view, which the hardware automatically mantains synchronized. It's the difference between a board with 2 sockets and 2 two separate servers. It's a cheap and convenient way to achieve parallelism and synchronize work. On the 2 socket server the coherency mechanism "hardware accelerates" synchronization. In the 2 separate servers case you must use explicit synchronization, via message passing/remote procedure calls, with its inherent overhead and extra development work (and bugs!). That's an awkwarder complication;). Obviously you still want to work with 170GB/s instead of 20GB/s, so you gotta be careful you didn't allocate your backbuffer in the "slow" memory window. The article linked earlier did point out how just allocating a smaller buffer used during frame rendering in the coherent window cost them a lot of time.

But I'm already talking out of my depth, please someone come help explain this thingy.
 
Can there not be an intermediate with the ability to flag chunks of address space as "currently in work", in which case a synchronization process is initiated, or "not currently in work" in which case any device can read/write from that space in the fast manner as described above, ignoring coherency issues on other devices accessing the ram? What address space granularity would be needed (not a flag for each bit, obviously, but what size chunks?), what would the transistor budget be for that sort of table, and is that feasible? Or is that in any way how HSA/HUMA/UMA whatever is intended to function?

HSA promises a unified and coherent memory space, with a very weak consistency model.
A memory value, if updated, should eventually become visible at some point to the rest of the system, emphasis on eventually.
The memory pipelines are not tightly coupled in the way that the x86 multicore fabric is.

Everything up to that final point in time, unless you are in that specific work item on the GPU, is borderline free for all.
What values you see at a specific location or what values that are visible to other locations written by the same program or from multiple other HSA coprocessors, may not be the same as what another client might be observing.

So memory is coherent, but while in the thick of things, don't hope for updates to necessarily make sense.
Fences and load acquire and store release operations basically checkpoint the points in the process where things absolutely must start making sense. The CPUs do have some requirement for synchronization operations in a more limited subset of cases.
This basically is as you say: do a bunch of work, then signal when the mess has settled.


The base HSA model doesn't forbid devices with stronger models, which can just skip some of the extra steps in places where they are strong. However, a wide umbrella is necessary for a platform that includes devices with very primitive memory subsystems or different architectures.

I haven't found a description of GCN's coherence protocol, and it is doubly complicated because GCN is internally already weak and considers coherence optional. The cache setup is a de facto coherence setup where there's a common L2 everyone writes to, which is the next tiny step up from not being coherent at all.

I can't speak to Kaveri, but the other upcoming GPUs like the one in Orbis have already indicated a pretty onerous flush requirement for coherent traffic (which Mark Cerny said was optimized with an option to just avoid the GPU cache hierarchy). That's indicative of a memory hierarchy on the GPU side just barely on this side of being incapable of being considered coherent with the CPUs.

Even if Kaveri improves on this a little, the GPU subsystem is much longer latency and has a massive amount of in-flight traffic. Sharing would need to be done carefully. Care is needed when contending for data between CPUs, but the interface and GPU pipeline latencies are such that the costs will be massively higher.
 
Last edited by a moderator:
Are there even two devices? Is taht the topology we have. I kinda assumed the idea of HSA would have involved a single memory controller to a single RAM pool. The idea of the CPU and GPU discretely working from the same RAM, and bumping into each other as they go, seems an awkward complication I don't understand the value in.
I'm not an expert in this area, but here's my interpretation and high level view. CPU's need low latency access to memory and GPU's need high bandwidth. Having two memory controllers makes it possible to customize for each work flow.

Coherency saps bandwidth so it's more performant to only be coherent when necessary. So the solution is to have unified memory with coherency on an as needed basis. AMD and ARM share the as needed viewpoint. I'm not sure about others.
 
Possible XB1 Shared memory at OS/HyperV level

As someone hinted here in an earlier post, its possible that the shared memory is implemented by the underlying OS and API's..

An example of this in the case of the 3 OS's running on Hyper-V in XB1 is explained here in a patent "Shared memory between child and parent partitions'" http://www.google.com/patents/US8463980?dq=ininventor:%22Bradley+Stephen+Post%22&hl=en&sa=X&ei=jkryUYOoDsm7iAfckoDoBQ&ved=0CHwQ6AEwCQ


The idea is the memory is shared between the PARENT partition and the CHILD partitions over USER/KERNAL mode ..

There is also the concept of virtualized GPU that will make use of these shared memory ... And thus the Dx/AMP/WinRT apis can leverage these shared memory .

It will be interesting if the XB1 does implement this patent, and if they can actually make it performantly work!

p.s. I'm betting that this is the architecture of the XB1 3 OS's ..
 
Last edited by a moderator:
AMD will release unified memory discrete GPUs from their roadmap. I would assume the unified memory work like a multicored CPU system where memory can be shared and flagged as in cache as out of date and in use. This will just mean the OS will have to be hUMA compliant to direct tasks to the respected processors and memory is fed in this way. Programmers can probably do API level GPGPU work like they always done while the OS will manage the memory.

We don't know if the ps4 or the one is HSA compliant or has hUMA.
 
Following on from other discussions, could some knowledgeable person care to explain exactly how memory is addressed on these devices? In split RAM pools it was obviously a case of GPU reads GPU RAM, CPU reads CPU RAM. What's not clear (for some of us ;)) about the HSA unified memory is what components can access what RAM with what caveats. It would appear that it's not a case of complete, arbitrary RAM address access by either component over their own buses.
In case of the XB1, the GPU can presumably address all RAM in the machine, the DDR3 as well as the eSRAM, while the XB1 CPU is limited to the DDR3 (at least directly). What is accessed how (coherence/caching behaviour) is mainly governed by setting up the page tables (or some apertures) accordingly. MS claims everything on the XB1 works with virtual addresses (also the GPU) and a flag for each page governs how the coherency is handled. When allocating memory, one has to specify how that should be handled. Memory accessed usually by the CPU (for usual caching behaviour, not write combining regions) will be allocated and set up for coherent access. If the GPU reads/writes to such a location, the CPU caches will be snooped to ensure consistency. Buffers for common GPU tasks get flagged for noncoherent access, which omits the snooping of the CPU caches and enables the full bandwidth (otherwise one has the limited bandwidth of the "Onion" link as AMD/Sony calls it). The CPU will never snoop the GPU caches. What effectively happens is that during the address translation it is determined which path the access takes and which coherence checks are done. The same is probably done to direct an access to the eSRAM or the DDR3 interface. It is probably handled by some aperture in the address range (inaccessible by the CPU and therefore always incoherent with the CPU). That way the GPU doesn't have to know if something is located in the eSRAM or the DDR3 (the dev still has to manage that, but the shader code is agnostic regarding the physical location, it is basically transparent to the shader). It is even possible that a buffer is just partially in the eSRAM. If some buffer is (partially) moved from DDR3 to eSRAM, the copied pages have just to be remapped to another physical location, the virtual address may even stay the same (but no idea if that is supported).

At least that would be logical to me and is in large parts supported by the leaked Durango documentation.
 
It is probably handled by some aperture in the address range (inaccessible by the CPU and therefore always incoherent with the CPU).

It doesn't have to be coherent just because it is accessible to the CPU.

Speculation: I expect the ESRAM to be mapped to a range of a range og the physical address space. The northbridge arbitrates requests from either the CPU or GPU and fetches/stores data from either ESRAM or the DDR3 bus based on physical address.

Cheers
 
It doesn't have to be coherent just because it is accessible to the CPU.
That's not what I wrote.
I wrote that it is always incoherent to the CPU because it is inaccessible by the CPU (speaking of accesses to the eSRAM). That's a slight difference :rolleyes:. If the CPU can't access it, there can't be copies in the CPU caches, therefore one doesn't need to snoop them, hence incoherent access is enough. And the northbridge handling this coherency stuff isn't even in the path of the GPU accessing the eSRAM.
The cases of coherent or incoherent access for the DDR3 accessible by both the CPU and GPU I mentioned earlier in my post. ;)
 
That's not what I wrote.
I wrote that it is always incoherent to the CPU because it is inaccessible by the CPU (speaking of accesses to the eSRAM). That's a slight difference :rolleyes:. If the CPU can't access it, there can't be copies in the CPU caches, therefore one doesn't need to snoop them, hence incoherent access is enough. And the northbridge handling this coherency stuff isn't even in the path of the GPU accessing the eSRAM.
The cases of coherent or incoherent access for the DDR3 accessible by both the CPU and GPU I mentioned earlier in my post. ;)

And hence the move engines, yes?
 
That's not what I wrote.
I wrote that it is always incoherent to the CPU because it is inaccessible by the CPU (speaking of accesses to the eSRAM). That's a slight difference :rolleyes:.

I know that's not what you wrote.

Why even discuss coherency if the ESRAM can't be accessed by the CPUs.

Are you suggesting they 'solved' the coherency issue by not having the CPUs access the ESRAM at all? That would be silly.

Cheers
 
This amd fusion presentation explains it really well.

http://amddevcentral.com/afds/assets/presentations/1004_final.pdf

Basically there appears to be 3 different types of memory accesses, local, uncached and cached. Each has limitations or boosts in performance depending on whether the cpu or gpu does the reading/writing. Also it appears interleaving is different or optimised for the different use case.

Even though memory is unified, there still appears to be a split between cpu system memory and gpu local memory. I believe page tables aren't shared so the OS needs to update what the GPU can see by updating its TLB, pages in system memory need to be locked for the GPU to use them.
 
Last edited by a moderator:
I know that's not what you wrote.

Why even discuss coherency if the ESRAM can't be accessed by the CPUs.

Are you suggesting they 'solved' the coherency issue by not having the CPUs access the ESRAM at all? That would be silly.
No it isn't. It guarantees that the GPU can always use the full bandwidth of the eSRAM because one does not need any checking of the CPU caches limiting the throughput (as possible for DDR3 accesses to pages flagged for coherent access). The eSRAM can be flexibly used by the GPU, but only the GPU (the CPU has only indirect access through transferring some content between DDR3 und eSRAM through the DMA/move engines). Hence, there doesn't need to be any coherency check with the CPU.

I really think an explanation of how the caching/coherency check behaviour and the eSRAM/DDR3 access work is allowed to mention this.

This amd fusion presentation explains it really well.

http://amddevcentral.com/afds/assets/presentations/1004_final.pdf
As said above, MS claims Durango does this system/local memory distinction not necessarily on a static split. It can determine it on a per per page basis. So it has probably evolved a bit.
 
Last edited by a moderator:
Gipsel said:
As said above, MS claims Durango does this system/local memory distinction not necessarily on a static split. It can determine it on a per per page basis. So it has probably evolved a bit.
The above paper is the same, you allocate memory and determine at allocation whether it's system or local. There are page locking symantics so yeah it's probably a flag in the page table and done on a page basis. The difference seems to be with how the memory is laid out in each page.

For local they use heavy interleaving, bandwidth optimized and for system it's just laid out to improve latency and coherency. Probably why garlic can't access onion memory pages and vice versa directly, they probably go through very different memory controller/cache coherency logic.
 
No it isn't. It guarantees that the GPU can always use the full bandwidth of the eSRAM because one does not need any checking of the CPU caches limiting the throughput (as possible for DDR3 accesses to pages flagged for coherent access). The eSRAM can be flexibly used by the GPU, but only the GPU (the CPU has only indirect access through transferring some content between DDR3 und eSRAM through the DMA/move engines). Hence, there doesn't need to be any coherency check with the CPU.

You have the same problem with DDR3, although to a lesser degree. The GPU can access DDR3 memory, which might be cached by the CPUs, to the tune of 68GB/s, more than the CPU block can provide (~30GB/s in the diagrams I've seen).

Let's assume the GPU access the ESRAM the same way it does main memory, in 64 byte chunks. That means two accesses per cycle to the ESRAM, four if it can read and write to different banks at the same time. The CPU block runs at twice the GPU frequency, so that's one or two tags per cycle that needs checked, easy.

If you ever end up in a situation where the GPU is hampered by bandwidth from the CPU block, you're doing something wrong.

And again, you don't need to cut the CPUs off from the ESRAM to guarantee the GPU full speed access, mark the pages mapping to the ESRAM memory region as uncacheble for the CPU. There, problem solved.

Cheers
 
In the Durango documentation at vgleaks, MS explicitly stated that the eSRAM is free from possible contention of CPU accesses. In other words, the CPU can't access it. That's the simplest solution. You save the connection and also the check of yet another flag of the pagetable entry (or against the eSRAM aperture, depending on implementation). Of course MS could have done it somehow differently, but apparently they didn't.
 
Back
Top