Xbox One (Durango) Technical hardware investigation

rapso · Mar 25, 2014

Barbarian said:
I don't remember seeing any research showing Forward+ performing better than Deferred at 1xMSAA.

in my tests, I've found a few cases where F+ was faster e.g
-very simple lighting, where the gbuffer creation pass was just overhead (e.g.. outdoor with one sun light)
-lighting of transparent objects (with deferred people usually do some lights per objects solution, while F+ can use the same tiles and doesn't waste calculations that are out of radius)
-simple shading (with a gbuffer you effectively become bandwidth bound, wasting ALUs on F+ seems to scale better).

but in a lot of other cases it's just slower. (tested on ATI and NV)

One can argue for "tiling" of course, but we all know how popular that was on X360 where MSAA was really for free.

Free on X360? not sure if that's sarcasm.

Cyan · Mar 25, 2014

Ike Turner said:
Yeah and on PS4, PC, Mobile etc... It's part of the UE4 source code and in no way exclusive to X1 as Cyan's post was making it sound.

To be honest, it never ever crossed my mind to even hint at the fact that it wouldn't run on other platforms. Like Pixel said my intention behind the post was to emphasize that Lionhead could use that technology with many first parties.

Talking of which, I wonder if the technology from Lionhead called Mega Mesh is going to make the cut on the Xbox One!

They modelled, rendered and lit a world made of 100 billion polygons with the Mega Meshes technology. --on a X360. :O :smile2:

Pixel · Mar 26, 2014

Since over on the ps4 tech thread we are on the topic of l2 sram cache cross cluster access times, Is that 2MB sram on Xboxen situated between the two clusters for cross cluster operations? Its distance between the cores would improve access times for data shared between the two clusters.

What did vgleaks durango docs say about that small sram cache sistuated between the two clusters?

pMax · Mar 26, 2014

Pixel said:
What did vgleaks durango docs say about that small sram cache sistuated between the two clusters

Don't know what they say, but imho this is what they could/should keep there: Page Tables (+hypervisor for security, maybe).
You have 3 VMs, and you need to page-walk very quickly,given the TLB trash you're likely supposed to have there...
L2 half rate? yeah, that could be a good explanation for the horrid latencies... but I doubt the Jaguar blocks are heavily customized. If it is full speed elsewhere, it will full speed there also, I bet.

fellix · Mar 26, 2014

Pixel said:
What did vgleaks durango docs say about that small sram cache sistuated between the two clusters?

I don't think there was any info on the mysterious SRAM pool prior to the revelation of the die-shot and there's still nothing about it since.

3dilettante · Mar 26, 2014

Pixel said:
Since we are on the topic of l2 sram cache cross cluster caccess times, Is that 2MB sram on Xboxen situated between the two clusters for cross cluster operations? Its distance between the cores would improve access times for data shared between the two clusters.

The hierarchy as disclosed does not mention this additional SRAM block, and it's not a clear win.
No matter how close it is, putting it on the cross-cluster miss path can would add latency in any case where the cache probe has to move on to the other L2/L1.

What did vgleaks durango docs say about that small sram cache sistuated between the two clusters?

Nothing, which might be because that documentation wasn't leaked. Another possibility is that it isn't discussed in dev docs because that storage is not programmer-relevant.

Pixel · Mar 26, 2014

Fascinating thanks. What you guys have logically come to the likely conclusion about makes sense, that if this was for a TLB/Pagetables for VMs. I'm not famliar with TLBs. You woulndt want to consume jaguar l2 cache space for that and take it away from the developers. This being a dev doc leak means that there would be no need to know about the cache if its not programmer-relevant, much like there is no need info in those docs for about the embedded 8GB NAND for OS/system/background/non-gaming side functionality, which facilitates multitasking, which devs also have no access to either and no need to know about.

3dilettante · Mar 26, 2014

pMax said:
Don't know what they say, but imho this is what they could/should keep there: Page Tables (+hypervisor for security, maybe).
You have 3 VMs, and you need to page-walk very quickly,given the TLB trash you're likely supposed to have there...

This makes a number of assumptions to do this.
The first is that AMD's cache hierarchy would distinguish page table data from other cached data. The normal behavior is that page table data is readily cached, and moves up and down the hierarchy.
Since the OS needs to be able to change page table bits, it certainly helps to be able to access that memory normally.
AMD has experience for what happens when TLB management and caching don't play nice, so mucking with this higher-risk and it hasn't been mentioned that Jaguar has changed this part of the system architecture.

This also requires that the extra SRAM be cache-coherent with the modules, since page table traffic and updates are just memory traffic.
How this SRAM would know that memory filling or spilling from the modules is for page data without housing and looking up the relevant page directory and table data for every access is another challenge.
This storage isn't significantly larger than the Jaguar L2s, so it would be quickly thrashed without this filtering.

The actual page tables for a AMD64 address space and multi-gigabyte physical memory space can readily exceed on-die storage as well.

L2 half rate? yeah, that could be a good explanation for the horrid latencies... but I doubt the Jaguar blocks are heavily customized. If it is full speed elsewhere, it will full speed there also, I bet.

Half-speed L2 arrays might add several core clock cycles of latency.
Numbers ranging from 188 to 192 don't seem like a big thing to quibble over.

Rockster · Mar 26, 2014

Pixel said:
What did vgleaks durango docs say about that small sram cache sistuated between the two clusters?

Yield redundancy for remapping of bad cells.

Why is this in the PS4 tech thread anyway?

Lalaland · Mar 26, 2014

Rockster said:
Yield redundancy for remapping of bad cells.

Why is this in the PS4 tech thread anyway?

Seems reasonable but isn't that kind of redundancy usually found within the banks themselves rather than in a completely separate area on the die? I'm not familiar with chip design beyond what I've read in generalist articles on cpu design so it's genuine curiosity rather a criticism of the idea. Would using 'spare' die space for redundancy allow you to remove the redundancy for the main bank and produce a more easily manufacturable shape? I believe the more square the chip the more you can get per wafer so perhaps an off bank redundancy scheme made for more efficient manufacture?

We may never know as I don't think we ever got details about the die layout for either the PS3 or the 360 from the manufacturers. We're so far off topic here though, should we move discussion to the Durango thread?

Grall · Mar 27, 2014

Rockster said:
Yield redundancy for remapping of bad cells.

That doesn't seem likely, considering the physical distances involved. Also, I don't think ~2MB of yield redundancy SRAM would be effective use of die space, that sounds way over the top really. If you have upwards of 2MBs worth of bad SRAM on a die, the rest of the die would be utterly ruined by flaws anyhow...

Pixel · Mar 27, 2014

Grall said:
That doesn't seem likely, considering the physical distances involved. Also, I don't think ~2MB of yield redundancy SRAM would be effective use of die space, that sounds way over the top really. If you have upwards of 2MBs worth of bad SRAM on a die, the rest of the die would be utterly ruined by flaws anyhow...

They said during the hotchip even the esram was self healing so its(esrams) existence and even size has little impact on yields. Thats definitely not for redundancy. And its location means its obviously for cpu usage (pMax suggests page tables and such however 3dilettante suggests caution for various reasons before jumping to that conclusion).

Pixel · Mar 27, 2014

3dilettante said:
This makes a number of assumptions to do this.
The first is that AMD's cache hierarchy would distinguish page table data from other cached data. The normal behavior is that page table data is readily cached, and moves up and down the hierarchy.
Since the OS needs to be able to change page table bits, it certainly helps to be able to access that memory normally.
AMD has experience for what happens when TLB management and caching don't play nice, so mucking with this higher-risk and it hasn't been mentioned that Jaguar has changed this part of the system architecture.

This also requires that the extra SRAM be cache-coherent with the modules, since page table traffic and updates are just memory traffic.
How this SRAM would know that memory filling or spilling from the modules is for page data without housing and looking up the relevant page directory and table data for every access is another challenge.
This storage isn't significantly larger than the Jaguar L2s, so it would be quickly thrashed without this filtering.

The actual page tables for a AMD64 address space and multi-gigabyte physical memory space can readily exceed on-die storage as well.

http://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview
Perhaps this sheds light on the topic.
<><><><><><><><><><><><><
Digital Foundry: You're running multiple systems in a single box, in a single processor. Was that one of the most significant challenges in designing the silicon?

Nick Baker: There was lot of bitty stuff to do. We had to make sure that the whole system was capable of virtualisation, making sure everything had page tables, the IO had everything associated with them. Virtualised interrupts.... It's a case of making sure the IP we integrated into the chip played well within the system. Andrew?

Andrew Goossen: I'll jump in on that one. Like Nick said there's a bunch of engineering that had to be done around the hardware but the software has also been a key aspect in the virtualisation. We had a number of requirements on the software side which go back to the hardware. To answer your question Richard, from the very beginning the visualization concept drove an awful lot of our design. We knew from the very beginning that we did want to have this notion of this rich environment that could be running concurrently with the title. It was very important for us based on what we learned with the Xbox 360 that we go and construct this system that would disturb the title - the game - in the least bit possible and so to give as varnished an experience on the game side as possible but also to innovate on either side of that virtual machine boundary.

We can do things like update the operating system on the system side of things while retaining very good compatibility with the portion running on the titles, so we're not breaking back-compat with titles because titles have their own entire operating system that ships with the game. Conversely it also allows us to innovate to a great extent on the title side as well. With the architecture, from SDK to SDK release as an example we can completely rewrite our operating system memory manager for both the CPU and the GPU, which is not something you can do without virtualisation. It drove a number of key areas... Nick talked about the page tables. Some of the new things we have done - the GPU does have two layers of page tables for virtualisation. I think this is actually the first big consumer application of a GPU that's running virtualised. We wanted virtualisation to have that isolation, that performance. But we could not go and impact performance on the title.

We constructed virtualisation in such a way that it doesn't have any overhead cost for graphics other than for interrupts. We've contrived to do everything we can to avoid interrupts... We only do two per frame. We had to make significant changes in the hardware and the software to accomplish this. We have hardware overlays where we give two layers to the title and one layer to the system and the title can render completely asynchronously and have them presented completely asynchronously to what's going on system-side.

System-side it's all integrated with the Windows desktop manager but the title can be updating even if there's a glitch - like the scheduler on the Windows system side going slower... we did an awful lot of work on the virtualisation aspect to drive that and you'll also find that running multiple system drove a lot of our other systems. We knew we wanted to be 8GB and that drove a lot of the design around our memory system as well.
><><><><><><><><><><><><><><

They keep on referring back to significant alterations and the whole chip designed around virtualization to facilitate a 'rich environment that could be running concurrently with the title [game]'.

Half-speed L2 arrays might add several core clock cycles of latency.
Numbers ranging from 188 to 192 don't seem like a big thing to quibble over.

And as we both know that tiny difference could be down to the slightly different layout and slightly more distance of the cpu clusters on ps4 than the xbox. Given how access times are so senstive to interconnect length, which is why l1 is so much faster than l2, its not hard to see it a slight difference.

3dilettante · Mar 27, 2014

Those claims are very generic, and say much less than I'd consider appropriate for the conclusion being jumped to.
They also stated that while they made changes to the chip, particularly the on-die fabric and CPU-GPU coherence, they didn't do much to modify the Jaguar component.
Messing with the handling of the page tables and TLBs would get in the way of the normal operation of the Jaguar cores, which AMD has already designed with significant virtualization capabilities natively.

The interview responses did not give details for what would be a very significant departure from the norm for x86 processors.

pMax · Mar 28, 2014

3dilettante said:
Messing with the handling of the page tables and TLBs would get in the way of the normal operation of the Jaguar cores, which AMD has already designed with significant virtualization capabilities natively.

Well, using SRAM space to hold a (i.e.) 2Mb page table, should not add much complexity. It doesnt require any change and you dont mess with TLB - it is just ESRAM usage in place of DDR to hold such static table (for GameOS).

Supposing XB1 GameOS uses 2Mb page table, of course. It seems quite reasonable, to me.
As an alternative, it can hold the hypervisor, but 2Mb for that seems too much even for MS, honestly.

Do you see a better fit/have another idea?

Nisaaru · Mar 28, 2014

Why would the system even need a 2MB page table? The hypervisor, game, dashboard VM are most likely linear mapped with huge pages. If there's some dynamic mapped VM it's for apps and I suppose the performance for these is really irrelevant to bother with this. At the moment I assume that smaller SRAM is used for buffer purposes. Huffman/jfif/x264/DMA-Engines, HDMI-In and maybe scaler to not trash memory bandwidth.

pMax · Mar 28, 2014

Nisaaru said:
Why would the system even need a 2MB page table? The hypervisor, game, dashboard VM are most likely linear mapped with huge pages.

er... because you cant use 4Gb pages, I think. And the next smaller unit is 2Mb. You need to have hardware-assisted translation, so you are binded to page table support of the processor.
Your suggestion looks good, yet I would wonder why it's in between CPU blocks, then.

Rockster · Mar 29, 2014

All interesting theories but the fact of the matter is that there is only 47MB of USABLE memory/cache on the Xbox One SOC. And that ESRAM block between the CPU's in the die shot doesn't have some esoteric undocumented use case. It's simply for yield redundancy.

shredenvain · Mar 29, 2014

Rockster said:
All interesting theories but the fact of the matter is that there is only 47MB of USABLE memory/cache on the Xbox One SOC. And that ESRAM block between the CPU's in the die shot doesn't have some esoteric undocumented use case. It's simply for yield redundancy.

How are you sure for a fact that it is for redundancy? How would the 2mb section even be useful for the same purpose as the main 32mb with it being placed so close to the cpu as opposed to after the gpu?

3dilettante · Mar 29, 2014

pMax said:
Well, using SRAM space to hold a (i.e.) 2Mb page table, should not add much complexity. It doesnt require any change and you dont mess with TLB - it is just ESRAM usage in place of DDR to hold such static table (for GameOS).

As I noted before, page table data is just data from the point of view of the cache hierarchy.
What is the method used for the logic governing that SRAM block that allows it to intercept memory traffic that happens to be for changes made to the page tables? It's doing the opposite of memory translation, taking a physical address and trying to figure out what page it needs to go to, then whether that page is being used for page table data.

If the virtual memory subsystem of the cores isn't changing, then the hardware in the CPU section isn't informing the uncore when this happens, and the write-back L2 is going hold onto those lines until they are evicted.
This reduces the effectiveness of the extra storage, and makes it important that it be coherent.
Looking at how terrible the latencies are for snooping, what does this gain?

Do you see a better fit/have another idea?

I'm not sure.
One thing Durango has that Orbis does not is a dedicated high-bandwidth Kinect feed to the SOC and extra coherent memory pathways and compute for that functionality.
That imposes storage and buffering requirements, and the Jaguar modules would not be good as intermediaries between the incoming data port and the dedicated pipeline because AMD's CPU memory hierarchy isn't set up to have data pushed into it, and the CPUs are not predictable from a latency standpoint.

Another idea that I'm not sure about in terms of practicality is a way of explicitly mapping a very limited amount of memory and a few special pages of memory such that every memory client can use it as a common memory space with enforced synchronization points.
I'm not sure how workable that is without some kind of buffered scheme with reads only come from data in storage prior to sync point, writes only visible in the version after it. The idea there is to allow disparate clients some ability to be coherent at specific points even when the non-CPUs cannot really support it on their own. At the same time, the coarse synchronization would give the CPU some way of using this memory pool without falling back to uncached reads.

Xbox One (Durango) Technical hardware investigation

rapso

Cyan

orange

Pixel

pMax

fellix

3dilettante

Pixel

3dilettante

Rockster

Lalaland

Grall

Invisible Member

Pixel

Pixel

3dilettante

pMax

Nisaaru

pMax

Rockster

shredenvain

3dilettante

Similar threads