Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
It's also been suggested that the L1 and L2 caches in GCN GPUs are in the hundreds of ns (or was that cycles?) and I'm assuming the ESRAM will be slower still. My understanding is GPU memory subsystems are not generally optimized for low latency the way CPUs are, so I have to wonder what the ESRAM latency advantage will actually be. Is it an order of magnitude compared to off chip DRAM, or simply a fractional advantage?

I'm not aware of benches for GCN. The numbers for code running on VLIW GPUs were very high.

If you have L2 misses and have to go out to memory, then the DRAM controller, very likely optimized to attain a high bandwidth at the expense of some latency, will incur a heavy cost, before it even goes to the DRAM itself. This penalty is likely to be lower for SRAM, as you don't need to care that much about the optimal sequence of opening and closing banks to get a high utilization. It won't remove the latency of the memory hierarchy of the GPU itself, but it will definitely be faster (but I can't put a number on it by how much).
The Vgleaks article on the PS4's "hUMA" implementation indicates that it takes 300-350 cycles to invalidate volatile lines, with 75 related to book keeping and the rest possibly devoted to setup and most importantly the completion of all in-flight accesses. That's before any writeback penatly.
The CU main cache path for L2 misses still looks to be long.

Perhaps the ROP caches can bypass that bit? There could be a different latency number for the different memory paths.
That wouldn't help for cases where the CUs need to read from the eSRAM, but perhaps blending and the like could see a more clear latency difference.
It might help if we knew more about the organization and implementation of the eSRAM. Some of the admittedly unverified writing about hints that there may have been a desire to keep the overall basis of its operation similar to the banking scheme employed by external memory. Hopefully the various programmable timings within that aren't as prohibitive.
 
The Vgleaks article on the PS4's "hUMA" implementation indicates that it takes 300-350 cycles to invalidate volatile lines, with 75 related to book keeping and the rest possibly devoted to setup and most importantly the completion of all in-flight accesses. That's before any writeback penatly.
The CU main cache path for L2 misses still looks to be long.
Isn't that 300-350 cycles more a measure of the depth of the queues (scaling the number of accesses in flight) and not so much an (unloaded) access latency by itself? The whole thing usually has do deal with some heavy contention. That's probably part of the reason why it isn't that simple to put just one latency number on it.
Perhaps the ROP caches can bypass that bit? There could be a different latency number for the different memory paths.
That wouldn't help for cases where the CUs need to read from the eSRAM, but perhaps blending and the like could see a more clear latency difference.
If one does just blending (which is a fire and forget operation from the view of the shader) the performance is mostly determined by bandwidth as evident from fillrate tests. Loading and storing framebuffer tiles instead of individual pixel to and from the ROP caches vastly reduces the number of read-write turnarounds and makes life for the DRAM controller a lot easier (long bursts). Things could change of course, if you need to read back the just written render target (worst case: while you are still writing to it).
 
Isn't that 300-350 cycles more a measure of the depth of the queues (scaling the number of accesses in flight) and not so much an (unloaded) access latency by itself? The whole thing usually has do deal with some heavy contention. That's probably part of the reason why it isn't that simple to put just one latency number on it.
The leak doesn't state the why, although it calls that count a fixed cost. It may be a pessimistic implementation of the invalidate logic, if it cannot tell whether it is in an unloaded case.
The wavefront granularity of vector ops and aggressive coalescing may have made it expanded the minimum period of time before the pipeline will allow an access to make its way past each stage in the pipeline.

If one does just blending (which is a fire and forget operation from the view of the shader) the performance is mostly determined by bandwidth as evident from fillrate tests.
Vector export instructions have to be granted access to the export bus, which could be where back pressure from the ROPs can lead to CU stalls if, for example, EXP_CNT is set low or some kind pathological case leads to a build-up that exceeds its max count.
 
Vector export instructions have to be granted access to the export bus, which could be where back pressure from the ROPs can lead to CU stalls if, for example, EXP_CNT is set low or some kind pathological case leads to a build-up that exceeds its max count.
That's of course possible. But this usually means one is bandwidth limited anyway. The ROPs should be able to buffer a few exports before they stop to accept new ones (or are they completely relying on the latency hiding of the shader cores?). That EXP_CNT gets decreased doesn't mean the export and the write to memory was carried out. It just means that the values are read from the registers and placed in the respective queue (so one can reuse the registers for something else). That means backpressure sets in only when the ROPs can't handle the exports fast enough, which is mainly a throughput thing.
 
That's of course possible. But this usually means one is bandwidth limited anyway. The ROPs should be able to buffer a few exports before they stop to accept new ones (or are they completely relying on the latency hiding of the shader cores?).
I'm not sure if the effect has been completely teased out.
For example, Tahiti introduced a crossbar after the ROPs that fed to memory, which was responsible for the significant clock and bandwidth normalized improvement over Cayman.
Localized contention for channels did pose a problem, although how much of the underutilization was due to localized bandwidth contention versus a possible limit to the number of pending operations the ROPs can buffer isn't clear.
I think that it might be evidence of cases where bandwidth consumption can be uniform globally over long periods of time, but with short-term contention that the ROPs cannot buffer around.

The rate that the eSRAM can send and receive data on a per-controller basis would be higher, assuming each 8 MB block has its own controller. The reduced amount of contention for those controllers and the reduced latency would require less buffering, which apparently the ROPs aren't that good at.
 
Thanks.

Just signed up to post this, and I have had a beer, but ... I have just got to say thanks for the wicked thread. I have spent at least 14 hours reading this ... Compared to the rest of the internet, you guys are right up there. You know who you are. Cheers.

Onto the subject at hand. I think the Xbox One hardware looks absolutely sweet. Elegant. It's been 25 years since I got to metal on the Amiga, but I would love to get into this bad boy. Those of you getting paid to work on this...nice one.

At the end of the day it's going to come down to the software. If the api's are on the money...it will fly.

Sorry for the lack of technical detail...
 
The Vgleaks article on the PS4's "hUMA" implementation indicates that it takes 300-350 cycles to invalidate volatile lines, with 75 related to book keeping and the rest possibly devoted to setup and most importantly the completion of all in-flight accesses. That's before any writeback penatly.
The CU main cache path for L2 misses still looks to be long.

And that's still a lot faster than having to flush the whole cache which would take something like 4K cycles on Xbox One, no?
 
And that's still a lot faster than having to flush the whole cache which would take something like 4K cycles on Xbox One, no?

In many cases, I believe so, but that's not relevant to the point I was discussing, which is whether that can be used as a hint about the memory pipeline's contribution to memory latency without the external DRAM.

That's only the fixed initial cycle cost.
The variable latency component is dependent on how much needs writeback and the speed of the bus used.
 
I've found some interesting bits about X1 power supply specs.

http://www.reddit.com/r/xboxone/comments/1mku5n/any_know_the_xbox_one_power_supply_specs/

qCRATnJ.jpg



I'm very surprised, I was thinking that X1 consumes around 170W... but 253W at 28nm? Even more than first X360, with 203W, 16.5A at 12V and 1A at 5V. But that was falcon at 90nm.
 
Power supply ratings to not imply actual system power consumption. Xbox One likely consumes far less than 253 Watts in actual operation.
 
Well it has to power the soc itself, ram, HDD, Bluray drive, southbridge, and Kinect
Insignificant next to the power of the [strike]sauce[/strike] SoC.


Anyways, the actual max DC output should be 12V*17.9A -> 214.8W, and 5W standby. The 253 number is a bit weird (Watt-hour i.e. joules). Even stranger that it's next to Spanish (?) when the rest of the label is in English and Chinese.
 
I've found some interesting bits about X1 power supply specs.

http://www.reddit.com/r/xboxone/comments/1mku5n/any_know_the_xbox_one_power_supply_specs/

qCRATnJ.jpg



I'm very surprised, I was thinking that X1 consumes around 170W... but 253W at 28nm? Even more than first X360, with 203W, 16.5A at 12V and 1A at 5V. But that was falcon at 90nm.

that is the maximum power rating, there is headroom because after use, every year the powersupply will drop in maximum output.

edit: unless you were hinting at... undocumented extra parts?.. because in that case you were pretty discreet ...:cool:
 
Insignificant next to the power of the [strike]sauce[/strike] SoC.


Anyways, the actual max DC output should be 12V*17.9A -> 214.8W, and 5W standby. The 253 number is a bit weird (Watt-hour i.e. joules). Even stranger that it's next to Spanish (?) when the rest of the label is in English and Chinese.


Yes the main contributor is the SOC, but we can´t dismiss Kinect draw, until Slim models Kinect required an additional power conector.

If the label is legit and not a Chinese clon, i guess is meant for Mexico, Input 100-127v (I think they use 110v at home)
 
Yes the main contributor is the SOC,
Never seen star wars have you. :(

but we can´t dismiss Kinect draw, until Slim models Kinect required an additional power conector.
Yes. Kinect's power adapter even specifies 12V, 1.08A max output, though K1 does have motors.

The sensor block diagram from hotchips doesn't seem to give the impression of any high power consuming component though. *shrug*

If the label is legit and not a Chinese clon, i guess is meant for Mexico, Input 100-127v (I think they use 110v at home)
It was just funny to see 2 lines out of them all have a third language.
 
Status
Not open for further replies.
Back
Top