Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
thanks xD

i'm sorry for my english
i don't mean that "this article confirms how the cpu jaguar doesn't decode the micro ops"
but " Not including a decoded micro-op cache and opting for a simpler loop buffer instead

I see no proof the XB1 has what you claim, can you point it out?.

Also i noticed that you linked to something on misterxmedia's blog, i would be careful with doing that it is not a reliable source
 
if you zoom in the core of gpu there are different alu

As stated earlier, the ALU are rotated. The images posted on a certain website showing close ups is very poor evidence of anything.

I have an analogy for you; if you were to take an apple weighing 100gms or so and you compared it to another 100gms apple that was directly cut in half, straight down the centre, would you assume the apple in two pieces had 'more apple' than the complete one?
 
As stated earlier, the ALU are rotated. The images posted on a certain website showing close ups is very poor evidence of anything.

I have an analogy for you; if you were to take an apple weighing 100gms or so and you compared it to another 100gms apple that was directly cut in half, straight down the centre, would you assume the apple in two pieces had 'more apple' than the complete one?

What about the weight of the left over bits stuck on the knife? ;p
 
The Xbox architects are on record as saying they did not mess with the standard Jaguar cores in the APU.
The leaked SDK flat out describes a standard Jaguar core.

AMD talks about the presence of Jaguar in its semicustom designs. Neither console has been described as trying anything significant with the cores.
 
The Xbox architects are on record as saying they did not mess with the standard Jaguar cores in the APU.
The leaked SDK flat out describes a standard Jaguar core.

AMD talks about the presence of Jaguar in its semicustom designs. Neither console has been described as trying anything significant with the cores.


The actual quote from Nick, take it for what it is ,

Digital Foundry: Is it essentially the Jaguar IP as is? Or did you customise it?

Nick Baker: There had not been a two-cluster Jaguar configuration before Xbox One so there were things that had to be done in order to make that work. We wanted higher coherency between the GPU and the CPU so that was something that needed to be done, that touched a lot of the fabric around the CPU and then looking at how the Jaguar core implemented virtualisation, doing some tweaks there - but nothing fundamental to the ISA or adding instructions or adding instructions like that.
 
The actual quote from Nick, take it for what it is ,

Which notably states that work stuck around the periphery of the cores. Changing how the clusters link is an interconnect issue, and virtualization is handled predominantly in the virtual memory system. This could touch on the TLBs, if that, but it would be outside of trying to re-architect the decoders and execution pipeline.

I'm not sure what could be gained, particularly since doing so would mean effectively designing a separate core, which reset development in a multi-year development effort.
The upside depends on there being some glaring problem with how AMD designed its OoO engine, since nothing like the units or features changed.
 
Something is really strange with power consumption of the x1. Most games seem to be capped at 90-100W (Forza 5, Forza Horizon 2, Diablo 3, Destiny, Watch Dogs) power consumption. The only game I've seen so far that uses more is Ryse with 110-115W (all without Kinect connected).
Diablo 3 for example seems to consume 85-93W. What does Ryse do, what others don't. Ryse was developed on one of the oldest SDKs so that should be more a disadvantage. Also Kinect is used for some things (which was not connected) but those ressources couldn't be used by Ryse at all.
Or is it just the older OS that is delivered with Ryse, that has bad energy saving-mechanisms.
 
Can you post a chart? And what is the method of checking power consumption? I only ask because I own most of those titles and verify the results myself.
 
Can you post a chart? And what is the method of checking power consumption? I only ask because I own most of those titles and verify the results myself.
Oh, just forget it. There was a difference I did not consider. Ryse & forza 5 are the games I've got on disc. So the disc drive might be the difference for the extra power consumption. But also it only access the drive when starting a game, not when you are actually playing it, but maybe it is just in standby or something like that.
 
Oh, just forget it. There was a difference I did not consider. Ryse & forza 5 are the games I've got on disc. So the disc drive might be the difference for the extra power consumption. But also it only access the drive when starting a game, not when you are actually playing it, but maybe it is just in standby or something like that.
oh okay. No worries.

Although really, would love to figure out how people are measuring their power consumption per device. I really need to track down what's getting my energy bill up.
 
So BW is a big thing, like, innit.

MS thought it worthy of mentioning that disabling Kinect for a game not only freed up part of a core, but also that it freed up about 1GB/s of additional main memory BW for games to use.

But the thing is ... MS didn't use the fastest memory that they could. DDR3 2400 is a proper legit spec, and 2400 doesn't seem to have a significant price premium to 2133. I know MS doesn't shop for memory at Amazon or New Egg, but this is interesting.

https://pcpartpicker.com/trends/memory/

Too much of a risk long term?
 
A 1 USD price difference at 10 million units sold (for the console) would be 10 million USD.

Even those graphs show that for the minimum price (which is closer to what MS may get, but still not indicative of what they pay, shows DDR3 2133 being more than 5 USD cheaper, often appearing to be 7 to 10 USD cheaper There are occasions that DDR3 2400 is cheaper, but it's significantly more rare.. And dips to lower prices at the low point happen more frequently for DDR 2133. So at a minimum at 10 million units that's greater than 50 million USD in savings.

Certainly miss the days when memory was cheap. I got my 32 gigs of 1866 memory for ~100 USD back in 2012 before memory prices skyrocketed. It's rare to see 16 gigs drop that low (average is 150-200 for 1866), now days. Not looking forward to eventually upgrading as the memory is going to massively more expensive than what I've got now. :(

Regards,
SB
 
Last edited:
But the thing is ... MS didn't use the fastest memory that they could. DDR3 2400 is a proper legit spec, and 2400 doesn't seem to have a significant price premium to 2133. I know MS doesn't shop for memory at Amazon or New Egg, but this is interesting.

Do you have a link to documentation or a date when DDR3 2400 was made a JEDEC standard speed?
It wasn't at the time, I don't think. Wikipedia still doesn't list it as going above 2133.
 
I could swear that a couple of months ago the Wiki page had DDR3 2400 as a JEDEC standard, but surea s hell it ain't there now ...

Edit: nope, doesn't seem to be there in the history ..?
 
Going through the SDK this I found interesting enough to share:
SIMD
The SIMD instruction set is extensive, and supports 32-bit and 64-bit integer and float data types. Operations on wider data types occupy multiple processor pipes, and therefore run at slower rates - for example, 64-bit adds are one-eighth rate, and 64-bit multiples are 1/16 rate. Transcendental operations, such as square root, reciprocal, exponential, logarithm, sine, and cosine, are non-pipelined and run at 1/4 rate. These operations should be used sparingly on Xbox One because they are more expensive relative to arithmetic operations than they are on Xbox 360.

This might shed some light on why some engines may be running particularly bad on Xbox One.
 
Someone can likely give more concrete numbers for the Xenos, but the increasing cost of transcendental instructions has precedent in the removal of the 5th unit of the VLIW5 GPUs to VLIW4. GCN continued with the trend, so Durango shouldn't be unique in that warning.

The actual impact on overall performance was muted for a number of the operations because even the dedicated units in VLIW5 required the inputs to be adjusted or normalized with a range that the approximations could handle.
This setup work was needed with or without the dedicated unit, so the drop in throughput for the transcendental in the midst of all that padding had to be viewed in context.
This and other related post in the thread discussing the transition showed that the overall increase in ALU count lead to something of a wash or modest penalty in the heaviest usage cases for transcendental ops.
https://forum.beyond3d.com/threads/amd-r9xx-speculation.47074/page-171#post-1370614
 
One thing that we didn't considered to date is the fact that XB1 GPU uses GPU MMU instead of IOMMUv2 which could be find in Kaveri and PS4. WDDM 2.0 virtual memory supports two models:

IoMmu model

In the IoMmu model each process has a single virtual address space that is shared between the CPU and graphics processing unit (GPU) and is managed by the OS memory manager.

To access memory, the GPU sends a data request to a compliant IoMmu. The request includes a shared virtual address and a process address space identifier (PASID). The IoMmu unit performs the address translation using the shared page table.

It mean that for GPU intensive applications in IoMmu model, GPU needs to bypass IOMMU and use garlic Bus (in Kaveri).

Dn894176.iommu_model.1(en-us,VS.85).png



GpuMmu model

In the GpuMmu model, the graphics processing unit (GPU) has its own memory management unit (MMU) which translates per-process GPU virtual addresses to physical addresses.

Each process has separate CPU and GPU virtual address spaces that use distinct page tables. The video memory manager manages the GPU virtual address space of all processes and is in charge of allocating, growing, updating, ensuring residency and freeing page tables. The hardware format of the page tables, used by the GPU MMU, is unknown to the video memory manager and is abstracted through device driver interfaces (DDIs). The abstraction supports a multilevel level translation, including a fixed size page table and a resizable root page table.

I'm not so sure about the accuracy of what I'm going to say but it seems to me that one of the differences between this models is that in "IoMmu model" the GPU only have access to the shared virtual address between CPU and GPU but in "GpuMmu model" the GPU could have it's own virtual address:

Similarly, in a linked display adapter configuration, the user mode driver may explicitly map GPU virtual address to specific allocation instances and choose for each mapping whether the mapping should be to self or to a specific peer GPU. In this model, the CPU and GPU virtual addresses assigned to an allocation are independent. A user mode driver may decide to keep them the same in both address spaces or keep them independent.

It gives GPU the ability to "eliminate the need for the video memory manager to inspect and patch every command buffer before submission to a GPU engine" while it's working on it's own virtual address on a GPU intensive applications through garlic Bus (XB1 case):

Under Windows Display Driver Model (WDDM) v1.x, the device driver interface (DDI) is built such that graphics processing unit (GPU) engines are expected to reference memory through segment physical addresses. As segments are shared across applications and over committed, resources gets relocated through their lifetime and their assigned physical addresses change. This leads to the need to track memory references inside command buffers through allocation and patch location lists, and to patch those buffers with the correct physical memory reference before submission to a GPU engine. This tracking and patching is expensive and essentially imposes a scheduling model where the video memory manager has to inspect every packet before it can be submitted to an engine.

As more hardware vendors move toward a hardware based scheduling model, where work is submitted to the GPU directly from user mode and where the GPU manages the various queue of work itself, it is necessary to eliminate the need for the video memory manager to inspect and patch every command buffer before submission to a GPU engine.

https://msdn.microsoft.com/en-us/library/windows/hardware/dn932167(v=vs.85).aspx

This section seems to be interesting too:
Dn894183.tile_resources.1(en-us,VS.85).png


To solve this problem the video memory manager does two things. First, it maps a GPU virtual address to all of the tile pool elements belonging to a process inside the process privileged address space. As tile pool moves around in memory, the video memory manager automatically keeps those GPU virtual address pointing to the right location for the tile pool using the same simple mechanism it does for any other allocation type.


Dn894183.tile_resources.2(en-us,VS.85).png


More about XB1 GPU MMU here:

http://www.computer.org/csdl/mags/mi/preprint/06756701.pdf
 
Last edited:
Status
Not open for further replies.
Back
Top