Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
Which parts of the document for a discrete SRAM component with two ports show how you can read and write data over the same bus in the same clock, leading to double the design's bandwidth?
(edit: discrete is an improper term, a better one would be "isolated")
From what I've skimmed, the peak bandwidth is what you get with two separate ports.

It looks more like it's a dual-ported SRAM with specifically outlined cases for when the inputs for the two ports lead to a conflict. The interface and control logic would be on the other side of all the control and data lines, and for an on-die version I would expect the pipeline logic to be smart enough to avoid the corner cases, especially since there are read-write conflicts that lead to unknown or old data being read back.

edit:
To summarize, I would like some exposition on why this should be considered relevant or what argument it is supporting.

I haven't been able to find anything that shows a memory that can read and write in the same clock on the same bus. Everything with simultaneous reads and writes has two data busses. That idea of reading and writing in the same clock on the same bus is a dead end, as far as I can tell.

I keep going back to this quote from Digital Foundry and I can't make sense of it.
... Now that close-to-final silicon is available, Microsoft has revised its own figures upwards significantly, telling developers that 192GB/s is now theoretically possible.

Well, according to sources who have been briefed by Microsoft, the original bandwidth claim derives from a pretty basic calculation - 128 bytes per block multiplied by the GPU speed of 800MHz offers up the previous max throughput of 102.4GB/s. It's believed that this calculation remains true for separate read/write operations from and to the ESRAM. However, with near-final production silicon, Microsoft techs have found that the hardware is capable of reading and writing simultaneously. Apparently, there are spare processing cycle "holes" that can be utilised for additional operations. Theoretical peak performance is one thing, but in real-life scenarios it's believed that 133GB/s throughput has been achieved with alpha transparency blending operations (FP16 x4).

The only thing that makes sense for "simultaneous" reads and writes is two busses. It can't be that they enabled DDR that wasn't previously working, because that would double bandwidth for reading or writing, and this says those individual operations remain at 102.4 GB/s. Oh well. I guess there's really nowhere for this conversation to go until someone leaks a clarification.
 
http://www.engadget.com/2013/05/21/building-xbox-one-an-inside-look/

Engadget said:
Spillinger joined Microsoft just as the company was beginning work on the first Kinect (then "Project Natal"). He hailed from IBM, where he led the team that created the Xbox 360's CPU. At the time (early '08), he thought he was joining the Xbox hardware team to get started on a next-generation gaming console.

"First I was the design architect in Intel, then a design manager at IBM, and when I joined Microsoft, the view was 'Okay, it's about time -- early '08 -- to start to think about the next gen,'" he says. "It didn't take us five and a half years to get there, because what happened is that the moment sort of turned around and we started development of Kinect. The entire focus was about shipping Kinect, which now, if you in retrospect see, is such a great success."

Does anyone know if Spillinger was working on the Kinect only? Did he work on the CPU/GPU/APU too? I wonder since "He hailed from IBM, where he led the team that created the Xbox 360's CPU."

Also, does anyone know if the 200 person MS silicon design team has other CPU (or GPU) design related talent that is known in the industry? (I am trying to figure out if there is evidence of CPU, GPU or APU custom design inside that 200 person team. [Curious if we can find any evidence of how stock or how custom the APU is. Not sure how much others will agree or what came from where but I think the 360 CPU, GPU and eDRAM showed some nice innovation.]
 
Last edited by a moderator:
I believe in an upclock. They are in a desperate situation, at least from a marketing point of view. Don Mattrick departure confirms the mess.
Well, in fact, who knows if Sony didn´t also upgrade the clocks?. These companies do spy each other...and this time it is even easier to know about the another because they both are using TSMC to make the chips. 28nm and GCN is mature enough to put some CUs without many yield loss at 850Mhz instead of 800.

For yields, practically every chip they make that works at 800MHz will probably also work at 1100MHz. The choice of clock speed is pretty much entirely about power -- how much do they want to supply and cool.
 
http://www.engadget.com/2013/05/21/building-xbox-one-an-inside-look/



Does anyone know if Spillinger was working on the Kinect only? Did he work on the CPU/GPU/APU too? I wonder since "He hailed from IBM, where he led the team that created the Xbox 360's CPU."

Also, does anyone know if the 200 person MS silicon design team has other CPU (or GPU) design related talent that is known in the industry? (I am trying to figure out if there is evidence of CPU, GPU or APU custom design inside that 200 person team. [Curious if we can find any evidence of how stock or how custom the APU is. Not sure how much others will agree or what came from where but I think the 360 CPU, GPU and eDRAM showed some nice innovation.]

I've been wondering about the upgraded Kinect. IIRC, the original Kinect had onboard chips to help alleviate the load on the 360's CPU. But to cut down costs, they ended up removing them.

Do we know if they have added them (or another set) back in? Do we have any specs other than the cameras in New Kinect?
 
The only thing that makes sense for "simultaneous" reads and writes is two busses. It can't be that they enabled DDR that wasn't previously working, because that would double bandwidth for reading or writing, and this says those individual operations remain at 102.4 GB/s. Oh well. I guess there's really nowhere for this conversation to go until someone leaks a clarification.

I agree.
Assuming it's actually a factual rumor and not someone repeating their skewed understanding of what was written.
One thing that did occur to me is that the eSRAM could still be single ported but be split into 2 banks with each bank having a separate bus.
The other option is the memory being dual ported.
If it's the former then you wouldn't see the peak rates very often, but you would see some improvement depending on the distribution of data.
Several early UMA designs did this for their main memory notably N64 and M2.
It seems unlikely to me that you wouldn't mention the memory being dual ported.
 
Even if it were banked, the theoretical peak would be reported for the scenario without bank conflicts.

One doesn't see AMD or Intel shying away from saying their L1 caches can support two reads a cycle when up until Haswell they could be readily hit with bank conflicts.
For that matter, we don't see Nvidia or AMD GPU bandwidth figures for their on-die storage saying anything but peak for things like the heavily banked LDS or their cache partitions.
 
A few days ago I found this thread and I have to admit: Some people here are lightyears ahead of me regarding this topic. But what I'm missing (especially regarding this bandwith-increase) is:

Developers (better the games they build) run in a virtualized environment. Is is possible that a different component has gained a performance boost and this "materializes" in a (partial) simultanious read/write?
 
A few days ago I found this thread and I have to admit: Some people here are lightyears ahead of me regarding this topic. But what I'm missing (especially regarding this bandwith-increase) is:

Developers (better the games they build) run in a virtualized environment. Is is possible that a different component has gained a performance boost and this "materializes" in a (partial) simultanious read/write?

You mean better performance due to virtualization?
 
Developers (better the games they build) run in a virtualized environment. Is is possible that a different component has gained a performance boost and this "materializes" in a (partial) simultanious read/write?

Virtualization allows for a shared resource to appear to separate environments like it is available to them alone, except that it may be slower than expected.

It won't make low-level resources appear out of thin air.

There have been instances of improvements, such as virtual machines communicating over a network port they physically share, but this is a lower level than that.
 
75mhz is around 10%,so no,it's not a "small" overclock.And if MS wants to maintain November as the launch date and release the console worldwide before christmas,I hope mass production has begun....
And Anandtech's analysis virtually eliminates this possibility, Jaguar's TDP does not scale linearly with clock:
Looking at Kabini, we have a good idea of the dynamic range for Jaguar on TSMC’s 28nm process: 1GHz - 2GHz. Right around 1.6GHz seems to be the sweet spot, as going to 2GHz requires a 66% increase in TDP.

Developers (better the games they build) run in a virtualized environment. Is is possible that a different component has gained a performance boost and this "materializes" in a (partial) simultanious read/write?
Virtualization adds an overhead - a barrier if you will - between code and maximum theoretical performance of the hardware - this is part of the trade off in splitting hardware resources between two virtualized environments. While you can optimise and improve virtualization to reduce the overhead, virtualization won't produce huge leaps in bandwidth as has been confusingly reported.
 
Virtualization allows for a shared resource to appear to separate environments like it is available to them alone, except that it may be slower than expected.

It won't make low-level resources appear out of thin air.

There have been instances of improvements, such as virtual machines communicating over a network port they physically share, but this is a lower level than that.

Ok, I wrote it unclear.

The hypervisor is a layer that offers physical (absolute adresses) memory to a virtual instance int the form of virtual memory. This iirc ensures that 2 running instances cannot access the same bit in the physical memory (excepting shared resources as you stated).

The Xbox One is known to have a hypervisor with a at least 1 virtual instance running Windows 8 (RT or whatever). Because this OS (Win) is running parallel to a running game (as shown with the Skype call during a game session during the reveal) I assume this running game is also a virtual instance.

With this assumption I think the game-instance accesses only "virtual memory" (with the mentioned performance-decrease already im place).

In my last post I asked if it is possible that a hardware-component responsible for the memory-layer got a performance boost and this boost ist promoted to the virtual game-engine?

Sorry for bad english :oops:
 
With this assumption I think the game-instance accesses only "virtual memory" (with the mentioned performance-decrease already im place).

Not necessarily, this is a very specific setup, I.e. you always have 1 game instance and it always has 5GB ( id rumors are correct) of physical memory assigned to the virtual instance.
There is no need to virtualized that, the hyper visor just has to be involved when it's mapped.

The overhead for virtualization like this could be incredibly small.
 
In my last post I asked if it is possible that a hardware-component responsible for the memory-layer got a performance boost and this boost ist promoted to the virtual game-engine?:
Short answer: No.

Virtualization is useful for servers (or CI) since it allows you to use better your resources, at the cost of a general slowdown of the processor executing the virtualized code.
In essence, for every instruction you execute, you have additional hardware that monitor which instruction it is, what it is accessing, if it is a special VM Call etc, and remap hiddenly/trigger hypervisor on need. Even in today's processor, this thing slows down execution: try to run windows/linux inside VMWare and outside the box - you can see the difference.

Virtualization is not made to use CPU resources at best with high performance apps, but rather to make more machines use the unused power left by slow, low performance apps.
 
And Anandtech's analysis virtually eliminates this possibility, Jaguar's TDP does not scale linearly with clock:



Virtualization adds an overhead - a barrier if you will - between code and maximum theoretical performance of the hardware - this is part of the trade off in splitting hardware resources between two virtualized environments. While you can optimise and improve virtualization to reduce the overhead, virtualization won't produce huge leaps in bandwidth as has been confusingly reported.

10 watts for 4 Jaguar cores including a boost of 100 MHz for 2 CUs.Suppossing 8 watts for the cpu only: going from 1,6 to 2 in Jaguar would be more or less 16 watts...If reliability is the same i think Sony will go for it.MS may be more intereted in a GPU upclock.
 
And Anandtech's analysis virtually eliminates this possibility, Jaguar's TDP does not scale linearly with clock:

Were still looking at an increase of around 20 to 30 W so we're not talking an insurmountable challenge here. We're still far away from the power guzzling of the original Xbox 360 and PS3.
 
Did you just compare it to a discrete GPU? These are the most powerful APUs to date


Here we go with this logic deficiency thing again. Of course, I compared the GPU to the closest available commercial part. The point, as was addressed above by others, is that there is massive headroom to overclock the GPU. Whether or that makes sense for the APU is a different story. However, a 10% overclock should introduce minimal heat and power increases and should not impact yield rates materially given the available headroom.
 
Pointing out that APU's are different from GPU's is a very accurate and valid point though. The CPU elements and the GPU elements on an APU do end up having somewhat different desires in terms of process targeting; you have one of 3 options - target to the CPU, the GPU or pick somewhere in the middle of the process window as the best compromise for both. In other words, comparing discrete components and assuming those will translate to APU's is not necessarily accurate because process targeting for the discrete components will be optimal for the discrete component, and you don't know the priorities for any of these APU's.

I could surmise that the with of the window may be a little easier for Jaguar relative to Bulldozer/Piledriver CPU cores, but I assume the same choices still need to be made with the targets on these APU's.
 
Pointing out that APU's are different from GPU's is a very accurate and valid point though. The CPU elements and the GPU elements on an APU do end up having somewhat different desires in terms of process targeting; you have one of 3 options - target to the CPU, the GPU or pick somewhere in the middle of the process window as the best compromise for both. In other words, comparing discrete components and assuming those will translate to APU's is not necessarily accurate because process targeting for the discrete components will be optimal for the discrete component, and you don't know the priorities for any of these APU's.

I could surmise that the with of the window may be a little easier for Jaguar relative to Bulldozer/Piledriver CPU cores, but I assume the same choices still need to be made with the targets on these APU's.

Pretty fair point, but considering how nicely balanced the Xbox One's GPU appears to be with regards to the rest of its design, I suppose it wouldn't be entirely surprising if they managed to find themselves in a convenient position to squeeze out a higher clock from a GPU that is already using less power? Isn't the ESRAM also helping them save on heat and power consumption? They may have designed themselves into a position where they can manage to squeeze out a few extra MHZ from the GPU.

I've also wondered for some time if either of these two APU on either console will be capable of turbo? Just curious btw, I am by no means attempting to question your knowledge on the subject. :p
 
Lay person here.

I ran across this paper on reconfigurable data cache (RDC) where 6 of 7 authors are associated with the BSC Microsoft Research Centre.

http://www.bscmsrc.eu/sites/default/files/189-armejach-pact11.pdf

Its basically a paper on a L1 cache that can dynamically convert from a general purpose cache to hardware based transactional memory implementation.

Its mostly referred to as "RDC" through out the paper except for one figure where its labelled as eSRAM.

I wonder if this has something to with eSRAM performance doubling because when in general purpose mode, RDC is a 64kb cache but in HTM mode configures itself into a 32kb cache. When its in HTM mode it maintains 2 copies of the data held in the cache. If this paper truly refers to eSRAM then I'm guessing "e" stands for "extended" as the paper labels the SRAM cells as e-cells because each 6T sram cell is paired and connected to another cell via exchange circuits. One "upper" cell and one "lower" cell makes up a HTM cell where reads and writes are handled by five different operations, URead, UWrite, LRead, LWrite and ULWrite.

If eSRAM was given the ability to perform two writes simultaneously, what the possibilities of it being given the ability to perform 2 simultaneously reads?

If this is really eSRAM what's the possibility of the engineers overlooking this functionality because MS was more concentrated on HTM and it didnt initially realize it could allow a bandwidth increase by a factor of less than 2?
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top