Xbox One (Durango) Technical hardware investigation

Betanumerical · Aug 22, 2013

Dave Baumann said:
hUMA as a "term" appeared relatively recently, after the VGAleaks articles. Previously you may have known it as HSA Cache coherency or CPU/GPU Cache coherency.

It would appear to me that these two quotes throw the entire idea of the Xbone being HSA Coherent / GPU<->CPU cache coherent or hUMA out the window completely, but I could be wrong.

There are two types of coherency in the Durango memory system:

Fully hardware coherent
I/O coherent

The two CPU modules are fully coherent. The term fully coherent means that the CPUs do not need to explicitly flush in order for the latest copy of modified data to be available (except when using Write Combined access).

The rest of the Durango infrastructure (the GPU and I/O devices such as, Audio and the Kinect Sensor) is I/O coherent. The term I/O coherent means that those clients can access data in the CPU caches, but that their own caches cannot be probed.

The CPU requests do not probe any other non-CPU clients, even if the clients have caches. (For example, the GPU has its own cache hierarchy, but the GPU is not probed by the CPU requests.) Therefore, I/O coherent clients must explicitly flush modified data for any latest-modified copy to become visible to the CPUs and to the other I/O coherent clients.

liquidboy · Aug 22, 2013

Betanumerical said:
By memory I mean the cached memory values in the GPU as well. The CPU cannot share a consistent view of the memory without being able to see whats in the GPU's caches as well.

"...if one processor makes a change, then the other processor sees that changed data...." ....

XboxOneDev : it's pretty clear that this was the exact implementation in the xbox one memory system..

pretty clear to me!

Shifty Geezer · Aug 22, 2013

IllusionistK said:
It seems as though people are clinging to this syntax without actually considering how the technology operates.

It shouldn't be considered an all or nothing feature. One assumes that any unified memory architecture can have different features and provide different functionality while still serving the same overall purpose, to different degrees of effectiveness. The VGLeaks coverage suggests XB1 has plenty of the hUMA functionality even if not 100% what AMD considers to be their hUMA design. IMO, when I posted the link in this thread, it was in response to those suggesting XB1 could be VI. Lack of the latest memory architecture nicely rules that out (among other obvious reasons). Lack of a hUMA label doesn't necessarily mean a lack of ability to share data easily between CPU and GPU, and people shouldn't be hung up on a label in understanding the platform's capabilities. The presence of eSRAM may well preclude a true hUMA design as designed, but with no negative consequences to the UMA operation of the CPU and GPU cores.

Betanumerical · Aug 22, 2013

liquidboy said:
"...if one processor makes a change, then the other processor sees that changed data...." ....

XboxOneDev : it's pretty clear that this was the exact implementation in the xbox one memory system..

pretty clear to me!

Where talking about coherency between the specific devices, such as the GPU and the CPU, not between the two processor modules. (ie the 2 halves of the CPU).

It seems pretty clear to me that they vgleaks document is flying straight in the face of what this 'dev' is telling us.

Ekim · Aug 22, 2013

Well according to some documentation, there is 1TB of Virtual Memory Address Space and that a page can be in ESRAM,DRAM or unmapped. So if you tell the CPU to fetch something from a virtual address, it shouldn't matter if it's actually in the DRAM or ESRAM. You could actually pass pointers between CPU and GPU .

Betanumerical · Aug 22, 2013

Ekim said:
Well according to some documentation, there is 1TB of Virtual Memory Address Space and that a page can be in ESRAM,DRAM or unmapped. So if you tell the CPU to fetch something from a virtual address, it shouldn't matter if it's actually in the DRAM or ESRAM. You could actually pass pointers between CPU and GPU .

Thats all fine and dandy, but if you want full hUMA you also need the cache coherency, which from my above quotes the XBONE does not have, all you described is the x86 style virtual addressing mapping that GCN does.

dobwal · Aug 22, 2013

Betanumerical said:
Thats all fine and dandy, but if you want full hUMA you also need the cache coherency, which from my above quotes the XBONE does not have, all you described is the x86 style virtual addressing mapping that GCN does.

You don't need the CPU to be able to read and write to the gpu caches for coherency. Fusion has cache coherency between the gpu and CPU. I don't recall an amd apu that doesn't. And that's facilitated by the gpu snooping the CPU cache.

And virtual address mapping is how both the CPU and gpu are given a complete and consistent view of memory when talking HSA.

Betanumerical · Aug 22, 2013

dobwal said:
You don't need the CPU to be able to read and write to the gpu caches for coherency. Fusion has cache coherency between the gpu and CPU. I don't recall an amd apu that doesn't. And that's facilitated by the gpu snooping the CPU cache.

And virtual address mapping is how both the CPU and gpu are given a complete and consistent view of memory when talking HSA.

Then why isn't every single one of AMD's APU's hUMA then? they say that only Kaveri is, but if this is true shouldn't any of the fusion APU's be it.

Edit :. AMD seems to say anything before Kaveri doesn't have full cache coherency.

dobwal · Aug 22, 2013

Betanumerical said:
Then why isn't every single one of AMD's APU's hUMA then? they say that only Kaveri is, but if this is true shouldn't any of the fusion APU's be it.

Edit :. AMD seems to say anything before Kaveri doesn't have full cache coherency.

Because hUMA is a specific form of cache coherency. It maintains coherency by passing pointers instead of data copying.

And esram is not a cache its a scratch pad. Esram is incoherent and so is memory served by the garlic bus.

Furthermore, HSA doesn't require that all memory to be coherent nor be virtually mapped, only the memory that needs to be shared by the cpu and gpu.

Ekim · Aug 22, 2013

dobwal said:
Because hUMA is a specific form of cache coherency. It maintains coherency by passing pointers instead of data copying.

And esram is not a cache its a scratch pad. Esram is incoherent and so is memory served by the garlic bus.

Furthermore, HSA doesn't require that all memory to be coherent nor be virtually mapped, only the memory that needs to be shared by the cpu and gpu.

but still, you can actually pass pointers between the CPU and GPU no matter if pointing to DRAM or ESRAM

Betanumerical · Aug 22, 2013

Ekim said:
but still, you can actually pass pointers between the CPU and GPU no matter if pointing to DRAM or ESRAM

It doesnt matter if its eSRAM, DRAM or unmapped, as i said earlier, its a feature of GCN and its virtual address space being the same as the x86 vm.

3dilettante · Aug 22, 2013

Brad Grenz said:
Maybe people are just confused because MS renamed hUMA in their docs just like they did every other AMD technology!

But actually, from what I understand the CPU can't access the ESRAM at all in Xbox One. That alone would preclude the design actually being called "hUMA" from my understanding. Whether that has any practical performance impact or not is probably a different question.

It doesn't need to be an all or nothing proposition. Each individual device can have storage that isn't coherent, CPUs and GPUs have plenty of buffers and queues that aren't fully coherent or are coherent at a coarser level than per memory access.
To say otherwise would rule out the PS4 anyway, because it sports color and Z caches that are not coherent.

HSA's definition of coherence is not very precise, and it probably doesn't have to be because it has a very, very weak memory model for non-CPU devices. The heterogenous components at the lowest level of compliance just need to have data visible to the system at or before a semaphore or lock variable is updated. The GPU needs to flush its caches and write back to memory, invalidating any shared CPU cache lines in the process.

This level of operation is what the PS4 has been described as using as well, with some shortcuts in the GPU cache invalidation process.
If there is a difference between the two consoles, it's a finer distinction than what has been disclosed.
There would need to be some little asterisk in AMD's requirements that filters one and not the other.

It may not even be coherence, as there were other hUMA features that could be skipped or not enabled for the platform, like GPU page fault handling, or the removal of pinned memory. The other possibility is that not enough leakers know what any of that is to even leak it.

zupallinere · Aug 22, 2013

If it turns out that c't was correct all it could mean is that the PS4 version of heterogeneous memory access is more in line with how AMD sees things. The XB1 is designed to leverage and support the trajectory of the directx feature set not how AMD would do things.

dumbo11 · Aug 22, 2013

3dilettante said:
It may not even be coherence, as there were other hUMA features that could be skipped or not enabled for the platform, like GPU page fault handling, or the removal of pinned memory. The other possibility is that not enough leakers know what any of that is to even leak it.

Maybe hUMA is the internal AMD name for 'the amendments Sony paid us to make to their APU cache'?

3dilettante · Aug 22, 2013

The Orbis changes are primarily shortcuts that don't change the overall process.
One possible difference is if there is a system-level forcing of the volatile bit on compute accesses to system memory, and a guaranteed automatic writeback done in hardware.

Flushing the GPU cache can accomplish this as well, but if there needs to be an explicit instruction in one implementation or a corner case where it is not guaranteed, that might be enough to make a distinction on one corner of the functionality.

zupallinere · Aug 22, 2013

3dilettante said:
The Orbis changes are primarily shortcuts that don't change the overall process.
One possible difference is if there is a system-level forcing of the volatile bit on compute accesses to system memory, and a guaranteed automatic writeback done in hardware.

Flushing the GPU cache can accomplish this as well, but if there needs to be an explicit instruction in one implementation or a corner case where it is not guaranteed, that might be enough to make a distinction on one corner of the functionality.

So that could be an aspect of what the XB1 guy was talking about here in a quote originally extracted by warchild and resurrected by dobwal :

we have to invest a lot in coherency throughout the chip, so there's been io coherency for a while but we really wanted to get the software out of the mode of managing caches and put in hardware coherency for the first time on the mass scale in the living room on the gpu.

3dilettante · Aug 22, 2013

This goes to my question about there being a problem with coherence. If it is automatic, the disclosed behavior doesn't seem to differ. hUMA and the overarching HSA umbrella are more descriptors of an end result, not a specific implementation.

GrimThorne · Aug 22, 2013

zupallinere said:
If it turns out that c't was correct all it could mean is that the PS4 version of heterogeneous memory access is more in line with how AMD sees things. The XB1 is designed to leverage and support the trajectory of the directx feature set not how AMD would do things.

The problem is that the article in question didn't make that distinction, and by not doing so it created the impression of a performance deficiency in the X1. It was the headline of this article and what the journalist (if you want to call him that) wanted to draw attention to. So I would say that it was intentional on the magazine's part to spin this bit of information against the other platform.

This is why company figures have to be careful who they sit down and grant interviews with. I can imagine that the AMD rep's answers were alot more thorough than what the article represents, but it appears the person he spoke to probably went on a cherry picking expedition.

mrcorbo · Aug 22, 2013

3dilettante said:
The Orbis changes are primarily shortcuts that don't change the overall process.
One possible difference is if there is a system-level forcing of the volatile bit on compute accesses to system memory, and a guaranteed automatic writeback done in hardware.

Flushing the GPU cache can accomplish this as well, but if there needs to be an explicit instruction in one implementation or a corner case where it is not guaranteed, that might be enough to make a distinction on one corner of the functionality.

From the Gamasutra interview with Cerny.

"Next, to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the 'volatile' bit. You can then selectively mark all accesses by compute as 'volatile,' and when it's time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time -- in other words, it radically reduces the overhead of running compute and graphics together on the GPU."

Seems a possible distinction is that PS4 offers better granularity to its cache flushes so that only data intended to be passed to the CPU is flushed while GPU exclusive data is retained. So then, absent this, graphics performance would take a hit each time data has to be passed to the CPU as the GPU waits for unnecessarily (for graphics-only operations) flushed cache data to be pulled from main memory (hopefully in the XBOne's case this would be the ESRAM). Is my understanding correct?

3dilettante · Aug 22, 2013

Flushing a cache selectively doesn't change the overall result. The case where the whole cache is marked volatile is the same as a total flush, and in the case where there is mixed data, flushing data that isn't coherent doesn't change the outcome, just the performance cost to ongoing graphics work to some variable degree.

Xbox One (Durango) Technical hardware investigation

Betanumerical

liquidboy

Shifty Geezer

uber-Troll!

Betanumerical

Ekim

Betanumerical

dobwal

Betanumerical

dobwal

Ekim

Betanumerical

3dilettante

zupallinere

dumbo11

3dilettante

zupallinere

3dilettante

GrimThorne

mrcorbo

Foo Fighter

3dilettante

Similar threads