Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
I've been mulling this over, and I'm uncertain if that isn't mixing two different levels of abstraction.
The WDDM description is for a general model provided by the system to software, and it would be applicable to a wide range of hardware--from discrete cards to APUs like Kaveri. Because it applies just as much to Kaveri as others, I don't know if we can use the description of WDDM to find a distinction between Durango and Orbis.
The hardware basis is still highly common between the two, as even Trinity supports IOMMU functionality.

The IoMmu mode may expose more of the IOMMU to software, but that may not mean that the IOMMU is not used in the GpuMmu case. GpuMmu has much more documented, and includes what services the system provides for abstracting away certain resource management actions. How the services are implemented may include hardware methods for accelerate mapping and transferring allocations, which an IOMMU can help with. Certain elements like user space queues have come about for AMD hardware because the GPU is now properly interfaced with the virtual memory system, which the IOMMU enables.

While the PS4's structure is not as well-documented, some level of abstraction and GPU memory management should be there as well. Various game presentations point out the GPU and system memory allocations, so not everything is treated as flatly as IoMmu mode would suggest.
 
I've been mulling this over, and I'm uncertain if that isn't mixing two different levels of abstraction.
The WDDM description is for a general model provided by the system to software, and it would be applicable to a wide range of hardware--from discrete cards to APUs like Kaveri. Because it applies just as much to Kaveri as others, I don't know if we can use the description of WDDM to find a distinction between Durango and Orbis.
The hardware basis is still highly common between the two, as even Trinity supports IOMMU functionality.

The IoMmu mode may expose more of the IOMMU to software, but that may not mean that the IOMMU is not used in the GpuMmu case. GpuMmu has much more documented, and includes what services the system provides for abstracting away certain resource management actions. How the services are implemented may include hardware methods for accelerate mapping and transferring allocations, which an IOMMU can help with. Certain elements like user space queues have come about for AMD hardware because the GPU is now properly interfaced with the virtual memory system, which the IOMMU enables.

While the PS4's structure is not as well-documented, some level of abstraction and GPU memory management should be there as well. Various game presentations point out the GPU and system memory allocations, so not everything is treated as flatly as IoMmu mode would suggest.

In the first line it says that in GpuMmu model, GPU has its own MMU:

In the GpuMmu model, the graphics processing unit (GPU) has its own memory management unit (MMU) which translates per-process GPU virtual addresses to physical addresses.

Each process has separate CPU and GPU virtual address spaces that use distinct page tables. The video memory manager manages the GPU virtual address space of all processes and is in charge of allocating, growing, updating, ensuring residency and freeing page tables. The hardware format of the page tables, used by the GPU MMU, is unknown to the video memory manager and is abstracted through device driver interfaces (DDIs). The abstraction supports a multilevel level translation, including a fixed size page table and a resizable root page table.

Xbox One GPU has its own MMU (not IOMMU):

MMU hardware maps guest virtual addresses to guest physical addresses to physical addresses for virtualization and security. The implementation sizes caching of fully translated page addresses and uses large pages where appropriate to avoid significant performance impact from the two-dimensional translation. System software manages physical memory allocation. System software and hardware keep page tables synchronized so that CPU, GPU, and other processors can share memory, pass pointers rather than copying data, and a linear data structure in a GPU or CPU virtual space can have physical pages scattered in DRAM and SRAM. The unified memory system frees applications from the mechanics of where data is located, but GPU-intensive applications can specify which data should be in SRAM for best performance.

The GPU graphics core and several specialized processors share the GPU MMU, which supports 16 virtual spaces. PCIe input and output and audio processors share the IO MMU, which supports virtual spaces for each PCI bus/device/function. Each CPU core has its own MMU (CPU access to SRAM maps through a CPU MMU and the GPU MMU).

cbyqxrappediszi0ck91.png


http://www.computer.org/csdl/mags/mi/preprint/06756701.pdf

Kaveri and Trinity utilize IOMMUv2 (which is necessary for HSA) and IOMMUv1, respectively. There is no official information about PS4 virtual memory system, but it's similarity with Kaveri in other areas suggest that it uses a similar solution (IOMMUv2). XB1 clearly supports GPU MMU.
 
Kaveri and Trinity utilize IOMMUv2 (which is necessary for HSA) and IOMMUv1, respectively. There is no official information about PS4 virtual memory system, but it's similarity with Kaveri in other areas suggest that it uses a similar solution (IOMMUv2). XB1 clearly supports GPU MMU.

...oh, come on. How do you think GCN can have virtual addressing for its shaders without a MMU unit? An IOMMU interface is clearly not suitable for that.

What those stuff wants to dissect is the difference between having GPU addressing mapped into CPU space using IOMMU, but having to 'fix' pointers due to phy addresses used for IOMMU mapping in the hi "physical" space memory area, and using full page tables mapping (va's) where phy address is obviously solved by an MMU.
 
...oh, come on. How do you think GCN can have virtual addressing for its shaders without a MMU unit? An IOMMU interface is clearly not suitable for that.

What those stuff wants to dissect is the difference between having GPU addressing mapped into CPU space using IOMMU, but having to 'fix' pointers due to phy addresses used for IOMMU mapping in the hi "physical" space memory area, and using full page tables mapping (va's) where phy address is obviously solved by an MMU.

If it's a solved problem in GCN, then why AMD published a proof-of-concept GPU MMU design paper like this in 2014?

http://research.cs.wisc.edu/multifacet/papers/hpca14_gpummu_appendix.pdf

Where did you get this information from? I can't find anything about GCN MMU unit. But I found a brief explanation about Kepler and it's MMU, here (as the world's first truly virtualizable GPU).
 
I was thinking about PRT. look i.e. here: http://www.anandtech.com/show/5261/amd-radeon-hd-7970-review/6

If a texture is missing, and you are using a page table, then you must have (some form of) MMU. Maybe it's not polished as the CPU one, but how would you manage PRT's page table else?


...maybe I misunderstood it, but the onion bus as per AMD description allows the GPU to access CPU memory, so some form of bookkeeping must be kept somehow.
 
I was thinking about PRT. look i.e. here: http://www.anandtech.com/show/5261/amd-radeon-hd-7970-review/6

If a texture is missing, and you are using a page table, then you must have (some form of) MMU. Maybe it's not polished as the CPU one, but how would you manage PRT's page table else?


...maybe I misunderstood it, but the onion bus as per AMD description allows the GPU to access CPU memory, so some form of bookkeeping must be kept somehow.

Onion/Onion+ bus have access to CPU memory through IOMMU. Read the IoMmu model.
 
ok, but what about PRT? as far as I understand, it must have some page table management no? Else, how would you make a thing that behaves like PRT without (some kind of) virtual memory?
I mean, your shader accesses some texture that is not there - somehow, the GPU must have some awareness of that.
 
In the first line it says that in GpuMmu model, GPU has its own MMU:
In the case of the AMD GPUs we are discussing, I think they all have one.
WDDM is presenting an abstraction to the software, but does not dictate the physical implementation. That there is a an option for using the virtual versus physical sub-mode stems from differing vendor implementations.
Since AMD's GPUs have featuresets that align so well with the more independent mode, I suspect AMD didn't excise the existing hardware from Durango.

Xbox One GPU has its own MMU (not IOMMU):
It has an Onion bus, and at some point it is updating to DX12, which indicates to me that there is a very high amount of IP commonality. What the platform chose to expose or diagram at a high level does not preclude it being used. The system services do not need to expose how they go about their business to the software above the driver.
 
IOMMUv2 is an AMD term. GPUMMU is an MS term in this context. One should not assume that MS and AMD terms can be mixed and match so easily especially when they readily use different terms in labelling figures in overviews of the same hardware.

The GCN whitepaper states that GCN incorporates an IOMMU. The GPUMMU or IOMMU when used by MS in reference to WDDM 2.0 may be used to differentiate solutions like when using a SOC with an unified memory versus a setup composed of discrete parts and/or separate memory pools. The XB1 isn't your average AMD SOC. It doesn't employ a unified memory setup as eSRAM is only viewed by gpu but is part of the virtual memory system of the gpu. While the audio setup may be labelled as IOMMU since memory managing involves sharing a common address space with the cpu.
 
In the case of the AMD GPUs we are discussing, I think they all have one.
WDDM is presenting an abstraction to the software, but does not dictate the physical implementation. That there is a an option for using the virtual versus physical sub-mode stems from differing vendor implementations.
Since AMD's GPUs have featuresets that align so well with the more independent mode, I suspect AMD didn't excise the existing hardware from Durango.

If AMD GPU's had their own MMU, hen there was no need for them to use IOMMU alongside it. GPU MMU has far more capabilities than an IOMMU:

Although current GPUs have virtual memory support, it is incompatible with the CPU virtual memory. Current GPU virtual memory, including IOMMU implementations [38], use a separate page table from the CPU process which is initialized by the GPU driver when the kernel is launched. Additionally, GPU virtual memory does not support demand paging or on the fly page table modifications by the operating system. This lack of compatibility increases the difficulty of writing truly heterogeneous applications.

The proof-of-concept GPU MMU design analyzed in this paper shows that decreasing the complexity of programming the GPU without incurring significant overheads is possible, opening the door to novel heterogeneous workloads.

http://research.cs.wisc.edu/multifacet/papers/hpca14_gpummu_appendix.pdf

Current heterogeneous systems use rigid programming models that require separate page tables, data replication, and manual data movement between the CPU and GPU. This is especially problematic for pointer-based data structures (e.g., linked lists, trees)1 . Recent work tries to address this using various smarter memory management schemes [20, 21, 25, 26]. Furthermore, latest CUDA releases permit limited CPU/GPU virtual address sharing [57]. However, none solve the problem using as general and flexible an approach as unified address spaces. A critical step to unified address spaces is to implement address translation in GPUs. As a first step, Intel and AMD equip today’s GPUs with Input Output Memory Management Units [1, 2, 23] (IOMMUs) with their own page tables, TLBs, and PTWs. These IOMMUs have large TLBs and are placed in the memory controller, making GPU caches virtually-addressed.

http://www.cs.rutgers.edu/~abhib/bpichai-asplos14.pdf

Also there is no sign of a GPU MMU in AMD GPUs in their documentations:

http://developer.amd.com/wordpress/media/2012/10/488821.pdf
http://developer.amd.com/wordpress/media/2012/10/34434-IOMMU-Rev_1.26_2-11-09.pdf

GPU MMU has obvious superiority over IOMMU model. WDDM does not dictate the physical implementation, but it could have different support modes for different physical implementations.

It has an Onion bus, and at some point it is updating to DX12, which indicates to me that there is a very high amount of IP commonality. What the platform chose to expose or diagram at a high level does not preclude it being used. The system services do not need to expose how they go about their business to the software above the driver.

XB1 has Onion bus, but it uses chive bus instead of Onion+ (an AMD spokesman confirmed). Also XB1 paper has very good detail about GPU MMU and it's main functionalities. Also there is good information about XB1 IO MMU and it's usage (which has no relation to GPU). So, I think it's better to talk about what we know or what we can learn from this papers/documentations instead of wasting our time over things that are invisible (in my opinion have no existence) like GPU MMU in AMD GPUs (which I couldn't find any reference to their existence).

IOMMUv2 is an AMD term. GPUMMU is an MS term in this context. One should not assume that MS and AMD terms can be mixed and match so easily especially when they readily use different terms in labelling figures in overviews of the same hardware.

The GCN whitepaper states that GCN incorporates an IOMMU. The GPUMMU or IOMMU when used by MS in reference to WDDM 2.0 may be used to differentiate solutions like when using a SOC with an unified memory versus a setup composed of discrete parts and/or separate memory pools. The XB1 isn't your average AMD SOC. It doesn't employ a unified memory setup as eSRAM is only viewed by gpu but is part of the virtual memory system of the gpu. While the audio setup may be labelled as IOMMU since memory managing involves sharing a common address space with the cpu.

IOMMU and MMU are two different things and it would be stupidness of MS to use them arbitrary. XB1 has "Unified, but not uniform, main memory" and supports "Universal host-guest virtual memory management". XB1 CPU access to SRAM maps through a CPU MMU and the GPU MMU.

System software manages physical memory allocation. System software and hardware keep page tables synchronized so that CPU, GPU, and other processors can share memory, pass pointers rather than copying data, and a linear data structure in a GPU or CPU virtual space can have physical pages scattered in DRAM and SRAM. The unified memory system frees applications from the mechanics of where data is located, but GPU-intensive applications can specify which data should be in SRAM for best performance.
 
Last edited:
If AMD GPU's had their own MMU, hen there was no need for them to use IOMMU alongside it. GPU MMU has far more capabilities than an IOMMU:
With AMD's shared memory, the IOMMU serves as the physical gatekeeper for the necessary page translations. It is there whether it's referenced or not in a high-level WDDM document that has to apply to multiple GPU vendors. GCN architecturally relies on the IOMMU.

GPU MMU has obvious superiority over IOMMU model. WDDM does not dictate the physical implementation, but it could have different support modes for different physical implementations.
The GpuMmu model has more fully featured abstractions for handling residency, formats, and allocation via OS facilities. This is rich in features, but the management process includes transfers to and from privileged memory and internal sleight of hand.

The IoMmu model provides a fully shared page model with the sort of flexibility and low-level exposure one can get when running CPU code. Superiority for the solutions is going to depend on what the goals are.

XB1 has Onion bus, but it uses chive bus instead of Onion+ (an AMD spokesman confirmed). Also XB1 paper has very good detail about GPU MMU and it's main functionalities.
Onion with or without the + has the IOMMU involved.
Apparently the GPUMMU in Durango physically exists in a similar spot and performs all the duties of an IOMMU in all the other GCN implementations.
(edit: perhaps more correctly, the IOMMU and GPU memory subsystem, which all APUs maintain)
Maybe in an chip that didn't have a separate MMU for a secondary IO cluster, there wouldn't need to be a need for disambiguation.
 
Last edited:
WRT buses: I pulled this from the SDK. Not sure if this helps or makes things more confusing, I've not fully wrapped my head around this MMU business yet, I prefer not to lol.

QZEQvNG.png
 
Wow, click bait title, they went from a job posting to that conclusion? Do you read these before you post them?
That article links to a job description that doesn't support the conclusion of the article. The job is for a person that will support developers in performing optimizations. It's not about improving xbox one performance on a general level. It's someone who can help devs if they're having problems.
Ok, you are right, maybe I shouldn't post it that way, it doesn't say much about the console. Anyways, for anyone interested here is part of the text in the actual job offer:

Candidate Requirements:
- 3+ years’ experience in game development with console experience strongly preferred.
- Thorough knowledge of C and C++.
- Experience in one or more of the following: DirectX, GPU performance optimization, HLSL, texture formats and compression.
- Past accomplishments in areas of problem solving, presentation skills and effective one-on-one communication are key to success in this role.
- Some travel; typically two or three trips per year entailing ~10% of work time.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, gender, sexual orientation, gender identity or expression, religion, national origin, marital status, age, disability, veteran status, genetic information, or any other protected status.
 
Status
Not open for further replies.
Back
Top