Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
The primary distinction for an L3 is that it is behind the L1 and L2, and it is considered part of the coherent memory hierarchy.

How it is banked, how it is pipelined, and even if it uses SRAM is an implementation choice. There can be multiple ports and heavy banking, and there can be conflicts and penalties for certain access patterns such as the eDRAM L3 for IBM's latest POWER chips.

L3 for current on-die implementations doesn't have the same capacity, low cost, and PCB interface requirements of external DRAM, so things like the more complex address and command decoding and the single read/write bus with all its turnaround penalties are normally dispensed with. The wires are much finer and there's no big bus of PCB traces that needs to be driven.

The need to manage allocation and evictions, often simultaneously, leads to frequent simultaneous reads and writes, which encourages at least two ports.
The latencies involved and the number of other caches that feed into it are figured into a pipeline that also has stages devoted to routing access and wakeup of arrays.
 
I know none of the console have a L3 cache. I was wondering about how a L3 cache managed read and writes.
I thought I said something to that in the second part.
Is it just like main memory where you have a single or bidirectional pipelined bus, or is it something different, like a collection of small switching banks where a write to one bank could be finishing up while a read from another was being prepared (which in this hypothetical scenario might allow tightly alternating reads/writes to be faster than continuous reads or writes).
Main memory accesses are pipelined since we have SDRAM, right. That works not in the same way, but somewhat similar to the pipelining in SRAM accesses.
In SDRAM, one can easily overlap the data transmission of a burst with the address transmission for the next access (addresses run over separate pins). And with todays multiple memory channels, they can operate independently from each other. That means a Tahiti GPU with it's 12 memory channels or the PS4 with it's 8 channels can freely mix reads and writes between channels (same is true for CPUs, but we have there currently only two to four channels). But as long as there are for instance 256 data pins for the memory interface, there is no way to transmit more than what the interface was designed for, no matter how one interleaves reads and writes. And a general question: Why should that interleaving make things faster? Because one could reuse the address? Even if that would be possible, the new data still has to run through the data pins/connections which is usually the bottleneck. Internally, a given DRAM could be read and written much faster if one would have a wider interface connecting to more banks in parallel. That's kind the idea with HBM with very wide interfaces.
And roughly the same is true also for SRAM. It may be heavily banked and all, but ultimately, the bandwidth is determinded by the width of th interface to it (the sum of the width of its ports, if it has multiple ones).
And if normal L3 doesn't behave like this, could what is in the Xbox One?
As said, probably all L3 caches pipeline their accesses, too. If it doesn't have designated read and write ports (which the designers are well aware of and don't need to "discover" them in the produced chips) one can reach the peak bandwidth with all kind of accesses, also mixed types.
 
Last edited by a moderator:
They added custom IP building blocks. The core parts of CPU and GPU are very likely untouched. The most invasive modification (besides linking two Jaguar CUs to a northbridge providing the SMP/coherency glue, but maybe they reused something from the AMD shelf) is probably the eSRAM and its connection to the memory hierarchy.
I worked on modifications to the core graphics blocks. So there were features implemented because of console customers. Some will live on in PC products, some won't. At least one change was very invasive, some moderately so. That's all I can say.
 
I worked on modifications to the core graphics blocks. So there were features implemented because of console customers. Some will live on in PC products, some won't. At least one change was very invasive, some moderately so. That's all I can say.
What a teaser!
I suppose you are not talking about the extended ACEs with more queues as this was documented already as a feature for upcoming graphics cards using the "GCN1.1" iteration (as well as this volatile flag stuff Mark Cerny talked about as customization and the FLAT memory model) and one could argue it is probably not so invasive to the core (okay, the flat memory accesses would count, I guess).
You can send me a PN. I can be as silent as a grave.

Edit:
I completely forgot about the additional high priority command processor for the PS4. Okay, I agree, there was definitely some customization.
 
Last edited by a moderator:
The compute front end for one of the GPUs does seem to be replumbed, and there seems to be some indication that some of the low-level prioritization and control settings may be available for tweaking.

VGleaks discussed the addition of JPEG decompression for Durango, and the presence of additional format support and specialized instructions for fetching the output and facilitating conversion to RGB.


I wish for an article like Dave's Xenos piece for the 360, or the amount of technical writing IBM put out when it tried to get Cell adopted for non-proprietary platforms.
The latest arrangement seems to leave disclosure control up to AMD's customers, and they don't have the same commercial interest for spreading the news about the hardware that IBM did.

I'd like to think there could still be in-depth coverage, but I am not certain those days are coming back.
 
I'm hoping for something meaty.

The full court press Cell got in 2005 dwarfs Xbox One's presence, and no Orbis.
:cry:

Different times we live in.
 
The compute front end for one of the GPUs does seem to be replumbed, and there seems to be some indication that some of the low-level prioritization and control settings may be available for tweaking.
As said, this was already documented by AMD for upcoming GPUs. I don't see anything on the frontend extending over that "GCN1.1" documentation. Besides the additional graphics command processor for the PS4 of course. That's a real addition.
And the ability for low level prioritization existed already in the original GCN. That devs get access to that (or not) is mainly a choice of the API/driver, not one of the hardware.
VGleaks discussed the addition of JPEG decompression for Durango, and the presence of additional format support and specialized instructions for fetching the output and facilitating conversion to RGB.
I wouldn't consider the decompression itself to be really a core change. As others have noted, it's some additional block somewhat connected to the memory system in the same way as the usual two DMA engines.
But the additional texture format needs to be supported of course, you are right. I suppose that's a modification to the TMUs which need to be able to fetch that format. But I'm not aware of additional instructions helping the conversion from the luma/chroma representation to RGB. That needs to be done with normal shader instructions.

Edit:
Not that someone misunderstands me, I'm just trying to pinpoint the extent of the changes (we know of), I'm not arguing against them. Just to clarify my intention.
 
Last edited by a moderator:
Which I guess is why things that appear in AMD docs that would appear useful for both consoles ends up in only one. Both contribute to something AMD can use in their computer line but not with their competitor.
 
But the additional texture format needs to be supported of course, you are right. I suppose that's a modification to the TMUs which need to be able to fetch that format. But I'm not aware of additional instructions helping the conversion from the luma/chroma representation to RGB. That needs to be done with normal shader instructions.
The custom format and tiling settings exist to allow efficient fetching. If it doesn't have a custom instruction for the fetch, it may be using a new value as a resource identifier for the existing read instructions. That would allow the same instruction to be used without exposing any low-level changes besides the bits used to define the new mode.
 
Which I guess is why things that appear in AMD docs that would appear useful for both consoles ends up in only one. Both contribute to something AMD can use in their computer line but not with their competitor.
I sense it's no coincidence they announced this new technology called Tiled Resources just before the Xbox One is coming out.

I wonder, is the AMD 7790 HD Bonaire compatible with that technology? Just curious....

Additionally, I like sleep, and... in fact I need to sleep, but I just wanted to say that it is an absolute pleasure to read your posts people (Gipsel, 3dilettante, and all the others).
 
Partially Resident Textures were introduced with GCN and are exposed in DirectX 11.2 through the Tiled Resources functionality.

There are two types of support of tiled resources, and the AMD DX11.1 support puts it in tier 2.
 
I sense it's no coincidence they announced this new technology called Tiled Resources just before the Xbox One is coming out.

I wonder, is the AMD 7790 HD Bonaire compatible with that technology? Just curious....

Additionally, I like sleep, and... in fact I need to sleep, but I just wanted to say that it is an absolute pleasure to read your posts people (Gipsel, 3dilettante, and all the others).

Funny they demo it with nvidia. Do not believe this is anything new.

Seems to be based on http://www.opengl.org/registry/specs/AMD/sparse_texture.txt
 
PRT is what the GCN hardware supports.

That's all well and good, but GPUs can and do supports tons of stuff that the APIs don't, at least not initially.
The same thing was exposed through the sparse textures extension for OpenGL, and Microsoft has chosen to expose it through Tiled Resources as part of its DX 11.2 expansion of its API.
It may as well, since it has a pretty important hardware platform incoming that gets all that GCN goodness.

There should be different ways of accomplishing the same end result without the specific hardware elements in GCN, although the second tier seems to be aligned well with GCN's particular implementation.
 
Did some googling and here's what I have found and could possibly explain the additional bandwidth. Just a layman by the way so I'll leave it to the technical guys like Gipsel and 3d to confirm if what I suggest below is possible.

As per VGleaks, ESRAM has seperate channels for reads and writes with a total of 51.2 GB/s each way for a total of 102GB/s. http://www.vgleaks.com/durango-memory-system-overview/

What if the ESRAM was designed using dual port technology called Renaissance 2X from a company called Memoir Systems which allows for each channel to read and write simultaneously as was indicated in the DF story. This would allow in theory for the almost doubling of bandwidth but I guess would be limited by the types of ops that could be worked on at a time in the 32MB ESRAM.

I'll provide the links to Memoir site as well as TSMC which indicates that ESRAM at 28nm can be fabbed with Single Port (SP) and Dual Ports (DP). Memoir also lists TSMC as a partner.

http://www.memoir-systems.com/index.php/products/renaissance-2x

http://www.tsmc.com/english/dedicatedFoundry/technology/28nm.htm
 
Did some googling and here's what I have found and could possibly explain the additional bandwidth. Just a layman by the way so I'll leave it to the technical guys like Gipsel and 3d to confirm if what I suggest below is possible.

As per VGleaks, ESRAM has seperate channels for reads and writes with a total of 51.2 GB/s each way for a total of 102GB/s. http://www.vgleaks.com/durango-memory-system-overview/

What if the ESRAM was designed using dual port technology called Renaissance 2X from a company called Memoir Systems which allows for each channel to read and write simultaneously as was indicated in the DF story. This would allow in theory for the almost doubling of bandwidth but I guess would be limited by the types of ops that could be worked on at a time in the 32MB ESRAM.

I'll provide the links to Memoir site as well as TSMC which indicates that ESRAM at 28nm can be fabbed with Single Port (SP) and Dual Ports (DP). Memoir also lists TSMC as a partner.

http://www.memoir-systems.com/index.php/products/renaissance-2x

http://www.tsmc.com/english/dedicatedFoundry/technology/28nm.htm

The SRAM is dual port or not. You don't discover this during your process of production
 
Did some googling and here's what I have found and could possibly explain the additional bandwidth. Just a layman by the way so I'll leave it to the technical guys like Gipsel and 3d to confirm if what I suggest below is possible.

As per VGleaks, ESRAM has seperate channels for reads and writes with a total of 51.2 GB/s each way for a total of 102GB/s. http://www.vgleaks.com/durango-memory-system-overview/
You miss-read it. ;) The ESRAM bus is 102.4 GB/s. the table showing 51.2 GB/s read and write is if you are moving data from ESRAM to ESRAM, and is there to compare the transfer speeds around different pools. ESRAM to DRAM is 68.2 GB/s, meaning 68.2 GB/s read from the ESRAM. If you were doing pure read from ESRAM, it's 102.4 GB/s, and the same for pure writes, so the bus is, as specified in the tech docs, 102.4 GB/s in BW, accounted for by 800 MHz * 128 bytes per clock.

There is no simultaneous read/write listed anywhere in the tech docs. The bus isn't bidirectional* and technically cannot exceed its 102.4 GB/s total BW. This rumour flies in the face of the computer architecture, and cannot be explained simply.

* I think I'm misusing that term. A bidirectional bus works both ways, so shares its total BW between reads and writes. The word I mean is to describe reading and writing simultaneously, which I can't think of the name of in computer architecture terms!
 
I can't answer as to why it was only "discovered" now. I only tried to understand what would make what was spoken about in DF story possible. Perhaps the API for the ESRAM was only written to use the channels as one way only at first and are now being optimized so the 2 way functionality is now being exposed. The DF story did say MS were pushing to make all functionality available and then to optimize them after so who knows.

Also, given that it has been stated here in the forums that the frame buffer for a 1080p image fits into the 32MB then there may be instances where the chance to process other ops comes up. That's why I asked Gipsel and 3D to say if my suggestion was possible.

If my suggestions is not possible then MS is being creative or has come up with some crazy one off to get what they suggested to developers. I am just curious to find out if it is or not.
 
Status
Not open for further replies.
Back
Top