Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
Given that Microsoft was apparently having yield issues, I'm more than a little surprised they are launching in 21 countries this year.

I hope this doesn't lead any credence to the down-clocking rumor.

it would more likely mean there were less or no esram troubles to begin with. you have to get pretty convoluted to turn it into a negative somehow.
 
The 512 GB/Sec Pitcairn L2 cache is 512kBytes.

The others are faster since they are replicated more often and/or are wider. I think that is the correlation to look at. They are smaller too but I think it is: replication x width x clock which determines the BW.
On Pitcairn, there are actually 1280 small register files (4kB in size), each with a bandwidth of 16 Bytes per clock (Hornet's number was too low ;)). There are 20 LDS arrays of 64 kB (consisting of 32 banks of 2kB) with a bandwidth of 4 bytes per bank and clock (up to 128 byte per clock per LDS). There are 20 vector L1-D caches, each delivering up to 64 bytes per clock. There are 6 scalar L1-D caches (working also as constant caches), each client (3 or 4 CUs are linked to one sL1-D) can fetch up to 16 bytes per clock (up to 64 Bytes per clock per sL1-D). And finally there are 8 tiles of 64kB L2 cache. Each tile has a bandwidth of up to 64 Bytes per clock.

As you see, there is not a single isolated data structure in a Pitcairn GPU (in neither GCN GPU) which can be read faster than with the 128 bytes per cycle the 32MB eSRAM of Durango is capable of.
 
Given that Microsoft was apparently having yield issues, I'm more than a little surprised they are launching in 21 countries this year.

I hope this doesn't lead any credence to the down-clocking rumor.
It more likely does the reverse, because they haven't told devs about any down clock.
 
As you see, there is not a single isolated data structure in a Pitcairn GPU (in neither GCN GPU) which can be read faster than with the 128 bytes per cycle the 32MB eSRAM of Durango is capable of.

Interesting data you posted.

But what I am missing/not understanding is how do you know "128 bytes per cycle the 32MB eSRAM of Durango is capable of".

Do you know for fact but are not allowed to say how?

Or do you have a reference you can share that I can go read/look at?

I am curious if the 32MB eSRAM is a single structure or if it is many smaller pieces distributed in some way.

I have seen a number of posted which indicate that the writers believe it is one block but I have seen no source for that data point. Can you share that?
 
Have you compared your calculations to the VGleaks articles with Durango documentation? While incomplete and not guaranteed to be completely 100% correct in the end, it has been corroborated as being sourced from official documentation by multiple outlets.
So far, it has been very consistent with official statements.

There are posters on this board with insider connections, which I have factored into my calculus, but the numbers I've commented on have been drawn from public sources, or can be calculated using known properties of the architectures involved.
 
Interesting data you posted.

But what I am missing/not understanding is how do you know "128 bytes per cycle the 32MB eSRAM of Durango is capable of".

Do you know for fact but are not allowed to say how?

Or do you have a reference you can share that I can go read/look at?
The 128 Byte per clock for the eSRAM was given in the documentation vgleaks cited from (which was confirmed to be legit). And in case you missed that, bkilian (who was in a team working on the XBOne [albeit not the eSRAM] not too long ago) also stated the 128 bytes per clock bandwidth. That number is as safe as it can be at this point in time.
 
On Pitcairn, there are actually 1280 small register files (4kB in size), each with a bandwidth of 16 Bytes per clock (Hornet's number was too low ;)). There are 20 LDS arrays of 64 kB (consisting of 32 banks of 2kB) with a bandwidth of 4 bytes per bank and clock (up to 128 byte per clock per LDS). There are 20 vector L1-D caches, each delivering up to 64 bytes per clock. There are 6 scalar L1-D caches (working also as constant caches), each client (3 or 4 CUs are linked to one sL1-D) can fetch up to 16 bytes per clock (up to 64 Bytes per clock per sL1-D). And finally there are 8 tiles of 64kB L2 cache. Each tile has a bandwidth of up to 64 Bytes per clock.

As you see, there is not a single isolated data structure in a Pitcairn GPU (in neither GCN GPU) which can be read faster than with the 128 bytes per cycle the 32MB eSRAM of Durango is capable of.

Isn't the eSram also a composite of banks and thus the 128 byte per cycle is also a composite?
 
The individual register files and caches are far more local to their memory clients. A register's data is going to the adjacent ALUs, and the caches to the adjacent CU memory pipelines.

The L2s have more distance to travel, and they have multiple clients on the other side of the crossbar.

The eSRAM is much larger in comparison, although it is drawn as interfacing to the GPU's memory subsystem, which may take care of servicing potential clients since the data there can go across the whole SOC. Subdivision within the eSRAM may make it a reverse situation to the L2s, where there are many more storage pools trying to wire into a smaller set of readers.
 
To avoid replies to each of the last three posts I will put a couple things here:

1) If you guys have inside information on how many bytes per cycle for Xbox One then I obviously can't debate you without inside information. You might be right, I have no idea without inside info. If you don't have inside information then I am not sure how you can make your assertions or if you are right. Perhaps you can explain your source of info. How do you know that to post it?
...
People making assertions repeatedly like they have inside information on the design will not sway me much as I see no references nor do I really believe that they have that inside info. I see no references at all. If you want to say "I know this for a fact but can't talk due to NDA" fine.
Yes, what I stated is factually correct, from inside info, although you can see it on vgleaks too. The bandwidth to the ESRAM is 102GB/s at 800 MHz.
 
Yes, what I stated is factually correct, from inside info, although you can see it on vgleaks too. The bandwidth to the ESRAM is 102GB/s at 800 MHz.

Well that must settle it then. I don't really understand (yet) as I was expecting/wishing something else. I am having a hard time wrapping my head around 5 billion transistors and 30MB of SRAM and the resulting specs.

So I will wait and watch to see how the two systems perform and what the game play and actual visual quality turns out. I am interesting in how more interesting worlds (in Skyrim and Dragon Age like games) work out if cloud computing is not vaporware but in general my interests are heavily graphics and science fiction related. I hope to see the system generate nicely graphical fidelity.
 
Last edited by a moderator:
I'm not a game developer so please correct me if I'm wrong.

The Xbox 360 guaranteed peak ROP throughput even with alpha-blending, depth-testing and MSAA. The Xbox One, like any modern GPU, does not have enough bandwidth to guarantee peak ROP throughput even without MSAA, when performing alpha-blending and depth-testing. For instance, when writing to a single 32-bit render-target with alpha-blending and depth-testing, the GPU has to read 8 bytes and write 8 bytes per pixel, which, without caching and compression in the ROPs, would bottleneck pretty much every GPU on the planet, except Xenos in the Xbox 360.

It is my understanding that this does not matter because:
1) modern ROPs are efficient enough to avoid many round-trips to the memory (for instance, by storing coarse depth information in the internal caches and using those values as a early-reject filter);
2) modern rendering techniques do not rely heavily on alpha-blending in the main pass (for instance, I recall from the Killzone presentation, that when filling the g-buffer in a deferred renderer, alpha-blending is disabled);
3) modern rendering techniques make use of multiple render targets, therefore reducing the relative weight of depth writes compared to color writes.
GPUs used to have plenty of bandwidth, but this has not been true for several years. Hence, both hardware and software have moved toward more efficient usage of caches and local memories.

I would not be surprised if some Xbox One titles will render to the main memory while using the ESRAM as a programmer managed cache for textures and compute shader output. After all, similar performing PC GPUs do not have much more bandwidth than the DDR3 pool in the Xbox One.

I also wonder whether the use of ESRAM instead of EDRAM was just a manufacturing choice or if simulations suggested the improved latency would improve performance significantly enough to justify the additional cost.
 
Xbone delayed in Asia for about a year.

This in addition to the horrible price in the UK and no subsidised box pretty much confirms the yield problems that were being touted recently.
 
Xbone delayed in Asia for about a year.

This in addition to the horrible price in the UK and no subsidised box pretty much confirms the yield problems that were being touted recently.

umm, no it doesn't. the sales in asia would have been low to nonexistent (as you know), so it has no bearing, and they're launching in 21 countries.

how many is ps4 launching in? arent i reading on gaf ps4 isnt launching in japan either? so they're having yield issues too? or maybe just japan is that irrelevant now?

the price is for kinect packed in, havent we been told that would raise the price a million times :rolleyes:
 
I also wonder whether the use of ESRAM instead of EDRAM was just a manufacturing choice or if simulations suggested the improved latency would improve performance significantly enough to justify the additional cost.

I still wonder if this is the case, why not tout it?

I guess perhaps because in order to tout it, you have to admit 1.2 teraflops in the first place?

But if MS is willing to use cloud as a power crutch, touting "super efficiency" or the like seems like an idea.


Maybe it's all too technical. Even saying "ESRAM" would be so far over most people's heads...
 
I still wonder if this is the case, why not tout it?

I guess perhaps because in order to tout it, you have to admit 1.2 teraflops in the first place?

I would wonder more why AMD or NV wouldn't jump to ESRAM/EDRAM and cash in the profits which would go to the GDDR5 producers over DDR3.
 
Xbone delayed in Asia for about a year.

This in addition to the horrible price in the UK and no subsidised box pretty much confirms the yield problems that were being touted recently.

umm, no it doesn't. the sales in asia would have been low to nonexistent (as you know), so it has no bearing, and they're launching in 21 countries.

how many is ps4 launching in? arent i reading on gaf ps4 isnt launching in japan either? so they're having yield issues too? or maybe just japan is that irrelevant now?

the price is for kinect packed in, havent we been told that would raise the price a million times :rolleyes:

For a machine that is supposedly having ESRAM yield problems, who's hardware situation is "behind schedule and a mess", who's tools situation recently went from 'shitty' to 'partly shitty' (thx bkilllian for that one), and has a 900 GFLOP GPU, I thought the XBO showed very well yesterday, particularly in-game with BF4 and with RYSE, at least as good as what was shown elsewhere.
 
I would wonder more why AMD or NV wouldn't jump to ESRAM/EDRAM and cash in the profits which would go to the GDDR5 producers over DDR3.

Well, unless some driver trick is possible, a desktop GPU with DDR3 and an ESRAM scratchpad would achieve abysmal performance on most existing titles and require developer effort for new titles (as well as Direct3D extensions). Also, a high-end GPU with such a large ESRAM scratchpad would be huge to manufacture (think of GK110 + at least 64 MB of ESRAM) and hard to cool. The L4 cache approach of Intel is probably the most sensible and it seems to perform well enough. Still not suited for high-end GPUs, though, and it doesn't make sense for ATI and NVIDIA to introduce a non-transparent feature only for low-end and mid-range GPU models. I recall the long term plan of AMD/NVIDIA is stacked memory, so they are probably focusing on that.
 
2) Many of you seem to be assuming that the 32MB is setup like L3 cache. I have no inside information and I am not making that assumption. I could be wrong, obviously.

I don't think it is a cache.

I think the ESRAM is just mapped to a region of the physical address space. Allocating ESRAM is thus a question of setting up page table entries to point to this region.

MS *could* remap existing allocations to main memory (making it a software controlled cache of sorts with 4KB pages/cache lines), but I doubt it. It could easily create hard-to-reproduce pathological cases and would require usage heuristics to be gathered on ESRAM use.

Cheers
 
Status
Not open for further replies.
Back
Top