Xbox One (Durango) Technical hardware investigation

Shifty Geezer · Jun 29, 2013

ROG27 said:
They seem to be enhance in x1 gpu to work with compressed data and esram. Apparently there are certain hardwired functions that can effectively truncate or totally avoid the path throough the pixel pipeline in some instances.

Where is the info coming from that they are enhanced beyond GCN?

astrograd · Jun 29, 2013

dafuq? tried editing and it double posted. :/

See next post for better presentation of what I tried posting.

Gipsel · Jun 29, 2013

astrograd said:
Thoughts?

As explained in detail above, ROPs don't write to (eS)RAM directly, but to its integrated caches providing a significantly higher bandwidth exactly for this blending stuff (and MSAA). Externally, the accesses have a far higher granularity than individual pixels.

astrograd · Jun 29, 2013

aaaaa00 said:
133gb/s while alpha blending makes some sense if they figured out some way to overlap reads and writes on the esram.

To do an alpha blend you need to read pixel a, read pixel b, and write pixel a+b. That's 3 memory ops (two reads one write)

Suppose you can overlap the write for a+b with the first read for the next pixel. read pixel a, read pixel b, write pixel a+b / read pixel c, read pixel d, write pixel c+d / read pixel e ...

So you get 133 gb/s vs 100 gb/s while alpha blending. Seems plausible.

Hmmmm...let's carry your idea further.

**total guesswork below**

Assume equal bandwidth for read/write ops. You read from A, read from B, then write to C (which is A+B). The subsequent alpha blending operation then starts with a read from D that is simultaneous with the write of C from the previous operation, etc, like this:

...etc. Note that you can get 12 operations in 9 cycles (columns=cycles here). Without this newly discovered ability to read+write simultaneously you can only do one or the other, so you'd only ever be able to get 9 ops in 9 cycles max.

The columns above with the X's in them represent cycles where the eSRAM cannot read+write simultaneously. All other columns/cycles can have the eSRAM read+write simultaneously. Evidently (for reasons unknown) the eSRAM can only read+write simultaneously for 7 out of every 8 cycles. If you are doing alpha blending that nets you an improvement of (12/9)=133%, which should net you around ~133GB/s effective bandwidth for eSRAM typically sitting at ~100GB/s (actual value is 136.5GB/s for the 102.4GB/s eSRAM). Depending on rounding, that matches the value for alpha blending as reported in the article.

Other shader ops will vary on how much overlap there is. The max possible overlap would be one where the shader op is set up to have a read+write every single cycle (except for that 1 out of every 8 where ya can't for some reason); aka you'd want alternating read/write ops where you simultaneously read from one value and write to another. This would represent a full overlap 7/8 of the time, and give you 2 times the ops per cycle than you'd get without this revelation for those 7 of every 8 cycles.

Newly discovered bandwidth=(102.4GB/s)(7/8)=89.6GB/s
New Total eSRAM bandwith=89.6GB/s + 102.4GB/s=192GB/s

That 192GB/s figure is the max theoretical value but you'll never get close to that unless you had a shader op set up to alternate reads/writes the entire time, which probably would never make a lot of sense.

Thoughts?

ROG27 · Jun 29, 2013

Shifty Geezer said:
Where is the info coming from that they are enhanced beyond GCN?

I am assuming they are customized moreso than just GCN based on the wording of the leaked durango gpu documentation found on vgleaks. Then again, they could just be referring to the GCN architecture improvements over AMD's last iteration of gpu architecture.

Kaotik · Jun 29, 2013

I honestly doubt either MS nor Sony did any "real customizations" to GCN (nor Jaguar), but rather just used what AMD had to offer, ACEs for example were always scalable to how ever many you want, and Sony went with more etc

blakjedi · Jun 29, 2013

astrograd said:
Hmmmm...let's carry your idea further.

**total guesswork below**

Assume equal bandwidth for read/write ops. You read from A, read from B, then write to C (which is A+B). The subsequent alpha blending operation then starts with a read from D that is simultaneous with the write of C from the previous operation, etc, like this:

...etc. Note that you can get 12 operations in 9 cycles (columns=cycles here). Without this newly discovered ability to read+write simultaneously you can only do one or the other, so you'd only ever be able to get 9 ops in 9 cycles max.

The columns above with the X's in them represent cycles where the eSRAM cannot read+write simultaneously. All other columns/cycles can have the eSRAM read+write simultaneously. Evidently (for reasons unknown) the eSRAM can only read+write simultaneously for 7 out of every 8 cycles. If you are doing alpha blending that nets you an improvement of (12/9)=133%, which should net you around ~133GB/s effective bandwidth for eSRAM typically sitting at ~100GB/s (actual value is 136.5GB/s for the 102.4GB/s eSRAM). Depending on rounding, that matches the value for alpha blending as reported in the article.

Other shader ops will vary on how much overlap there is. The max possible overlap would be one where the shader op is set up to have a read+write every single cycle (except for that 1 out of every 8 where ya can't for some reason); aka you'd want alternating read/write ops where you simultaneously read from one value and write to another. This would represent a full overlap 7/8 of the time, and give you 2 times the ops per cycle than you'd get without this revelation for those 7 of every 8 cycles.

Newly discovered bandwidth=(102.4GB/s)(7/8)=89.6GB/s
New Total eSRAM bandwith=89.6GB/s + 102.4GB/s=192GB/s

That 192GB/s figure is the max theoretical value but you'll never get close to that unless you had a shader op set up to alternate reads/writes the entire time, which probably would never make a lot of sense.

Thoughts?

Why are you adding in the 102.4 gb/s again? Is that for the eighth operation (read or write)?

Shifty Geezer · Jun 29, 2013

ROG27 said:
I am assuming they are customized moreso than just GCN based on the wording of the leaked durango gpu documentation found on vgleaks. Then again, they could just be referring to the GCN architecture improvements over AMD's last iteration of gpu architecture.

That's what it reads like to me. "These are the features of our GPU architecture (which happen to be the same as AMD's GCN, but we're talking about Durango devkits here so aren't going to meantion that)." The tech doc would also be the first a lot of devs who aren't reading up on PC GPU architectures would read about such features, so you'd want to describe them without reference to prior knowledge they may not have. It's worth noting that MS talked about some GCN features using non-standard names, which supports this notion of addressing the devs from a blank slate and using their own terminology to talk about the architecture. For devs coming to GCN architecture from old consoles, the ESRAM BW would be looked at as the limit available for all operations. It makes sense to highlight the DB and CB components and say, "you also get this new feature which improves performance over what you're used to." Note that a key audience for MS and Sony tech docs are their first-parties, who have zero cause to follow and understand PC graphics developments, and will definitely need some of this explained.

Gipsel · Jun 29, 2013

astrograd said:
Assume equal bandwidth for read/write ops. You read from A, read from B, then write to C (which is A+B). The subsequent alpha blending operation then starts with a read from D that is simultaneous with the write of C from the previous operation, etc, like this:

In adition to what I wrote already (the blending operations don't work directly on memory but on data in the ROP caches), a blending operation doesn't work that way neither. If you export a color value c with the alpha value a to a pixel which has already the color value c_old (as a simplification I assume an alpha value of 1, i.e. intransparent for it) through the ROPs, then the operation looks like that:

(i) ROP loads c_old from memory (with the ROP caches, the tile containing this pixel is loaded and decompressed [if color compresion is used] into the ROP cache)
(ii) calculate c_new = c*a + c_old*(1-a)
(ii) ROP writes c_new to memory (with the ROP caches, the pixel and the whole tile containing it stays in the ROP cache until it gets the least recently used one so it gets evicted to memory when the ROP needs to cache new data [or the pipeline gets flushed]; that way, spatial locality gets used for neighboring pixels using the same tile or also consecutive nearby triangles; epecially when there is overdraw and one has relatively small triangles this leads to bandwidth savings because of data reuse [and with MSAA there are more bandwidth savings because of the color compression]).

In any case, each blending operation needs exactly one read and one write and not two reads and a write. The color value c is supplied by the shader, not from memory.

astrograd · Jun 29, 2013

Gipsel said:
In adition to what I wrote already (the blending operations don't work directly on memory but on data in the ROP caches), a blending operation doesn't work that way neither. If you export a color value c with the alpha value a to a pixel which has already the color value c_old (as a simplification I assume an alpha value of 1, i.e. intransparent for it) through the ROPs, then the operation looks like that:

(i) ROP loads c_old from memory (with the ROP caches, the tile containing this pixel is loaded and decompressed [if color compresion is used] into the ROP cache)
(ii) calculate c_new = c*a + c_old*(1-a)
(ii) ROP writes c_new to memory (with the ROP caches, the pixel and the whole tile containing it stays in the ROP cache until it gets the least recently used one so it gets evicted to memory when the ROP needs to cache new data [or the pipeline gets flushed]; that way, spatial locality gets used for neighboring pixels using the same tile or also consecutive nearby triangles; epecially when there is overdraw and one has relatively small triangles this leads to bandwidth savings because of data reuse [and with MSAA there are more bandwidth savings because of the color compression]).

In any case, each blending operation needs exactly one read and one write and not two reads and a write. The color value c is supplied by the shader, not from memory.

Ok, that makes sense. Is there anything that you know of that might explain the alpha blending bandwidth example offered up from the article?

blake, yup! My model probably isn't right though, but figured I'd note it since the figures worked out and ppl here can correct it and tweak it as desired.

Gipsel · Jun 29, 2013

astrograd said:
Ok, that makes sense. Is there anything that you know of that might explain the alpha blending bandwidth example offered up from the article?

Either MS did a recount of the bitlines to the eSRAM and discovered that there are more in the silicon than they wished for (AMD did them a favour, without telling them until last week?

) or there is something fishy going on. Judging from sebbi's tests with tiled access of the render target to maximize usage of the ROP caches, it should be easily possible to create a benchmark showing 192GB/s bandwidth for blending operations with the right access pattern. I really hope it doesn't boil down to that as it wouldn't show a higher eSRAM bandwidth at all but merely the standard caching features of the ROPs.

ROG27 · Jun 29, 2013

Shifty Geezer said:
That's what it reads like to me. "These are the features of our GPU architecture (which happen to be the same as AMD's GCN, but we're talking about Durango devkits here so aren't going to meantion that)." The tech doc would also be the first a lot of devs who aren't reading up on PC GPU architectures would read about such features, so you'd want to describe them without reference to prior knowledge they may not have. It's worth noting that MS talked about some GCN features using non-standard names, which supports this notion of addressing the devs from a blank slate and using their own terminology to talk about the architecture. For devs coming to GCN architecture from old consoles, the ESRAM BW would be looked at as the limit available for all operations. It makes sense to highlight the DB and CB components and say, "you also get this new feature which improves performance over what you're used to." Note that a key audience for MS and Sony tech docs are their first-parties, who have zero cause to follow and understand PC graphics developments, and will definitely need some of this explained.

I see. Makes sense since first party devs develop exclusively in a console environment and are not exposed to the new techniques employed by more modern, more powerful PC hardware.

Shifty Geezer · Jun 29, 2013

Yep. But more than that, most devs on PC have no need to know about such low-level mechanics either. It's only console devs that look so closely at the way the hardware works (the engine-developing ninja code-monkeys, anyhow).

3dilettante · Jun 29, 2013

Gipsel said:
Either MS did a recount of the bitlines to the eSRAM and discovered that there are more in the silicon than they wished for (AMD did them a favour, without telling them until last week? ) or there is something fishy going on. Judging from sebbi's tests with tiled access of the render target to maximize usage of the ROP caches, it should be easily possible to create a benchmark showing 192GB/s bandwidth for blending operations with the right access pattern. I really hope it doesn't boil down to that as it wouldn't show a higher eSRAM bandwidth at all but merely the standard caching features of the ROPs.

That would be depressing, and somewhat confusing. I don't think there would be a good reason for this not being noticed very quickly and very early on. While an optimal pattern would be needed for the upper range of the theoretical peak, the hardware should have been automatically doing a lot of this by itself, unless the color caches that have been around since forever were for some reason broken.

The same goes for just discovering that you can send reads and writes to the eSRAM. What kind of architecture does it have if it can't pipeline accesses and receive more requests while it is servicing earlier accesses? A workload that has reads and writes would automatically find some of them being handled together if the hardware isn't blocking on everything.

Gipsel · Jun 29, 2013

3dilettante said:
The same goes for just discovering that you can send reads and writes to the eSRAM. What kind of architecture does it have if it can't pipeline accesses and receive more requests while it is servicing earlier accesses? A workload that has reads and writes would automatically find some of them being handled together if the hardware isn't blocking on everything.

Exactly. That's why it eventually boils down to the width of the interface to the eSRAM. If it is in total 1024 bits wide (assuming it runs at 800MHz), there is no way to transfer more than 102GB/s. If there are more (like two times unidirectional 1024bits equaling ~204.8GB/s peak bandwidth), someone has designed it that way and it should have been known from the beginning. The whole explanation "they found some empty cycles" doesn't make sense at all. One has only empty cycles if the interface doesn't run at capacity. And as others have said already, nobody takes bank conflicts (reducing the usable bandwidth) into account for peak numbers as they only show up when using unfavourable access patterns.

function · Jun 29, 2013

How does a L3 cache controller manage reads/writes? Could this behaviour be closer to reading/writing from/to a large on die L3 cache than a typical main memory controller?

Gipsel · Jun 29, 2013

None of the consoles have an L3. And the eSRAM ist software managed, the developer has to allocate memory in there and move the data in and out. It's not a hardware managed cache.
But aside from that, all caches of the last decades claiming to be some kind of performance piece pipeline their accesses as 3dilettante mentioned. It's not like SRAM and SRAM controllers were just invented last year. They are usually very well understood. The responsible engineers know the cycle perfect function by heart before the device tapes out.

3dcgi · Jun 29, 2013

Kaotik said:
I honestly doubt either MS nor Sony did any "real customizations" to GCN (nor Jaguar), but rather just used what AMD had to offer, ACEs for example were always scalable to how ever many you want, and Sony went with more etc

Not speaking specifically about the DB/CB... there are real customizations to both console chips. Some you've heard about. Some you never will.

Gipsel · Jun 29, 2013

3dcgi said:
Not speaking specifically about the DB/CB... there are real customizations to both console chips. Some you've heard about. Some you never will.

They added custom IP building blocks. The core parts of CPU and GPU are very likely untouched. The most invasive modification (besides linking two Jaguar CUs to a northbridge providing the SMP/coherency glue, but maybe they reused something from the AMD shelf) is probably the eSRAM and its connection to the memory hierarchy.

function · Jun 29, 2013

Gipsel said:
None of the consoles have an L3. And the eSRAM ist software managed, the developer has to allocate memory in there and move the data in and out. It's not a hardware managed cache.

I know none of the console have a L3 cache. I was wondering about how a L3 cache managed read and writes.

Is it just like main memory where you have a single or bidirectional pipelined bus, or is it something different, like a collection of small switching banks where a write to one bank could be finishing up while a read from another was being prepared (which in this hypothetical scenario might allow tightly alternating reads/writes to be faster than continuous reads or writes).

And if normal L3 doesn't behave like this, could what is in the Xbox One?

Xbox One (Durango) Technical hardware investigation

Shifty Geezer

uber-Troll!

astrograd

Gipsel

astrograd

ROG27

Kaotik

Drunk Member

blakjedi

Shifty Geezer

uber-Troll!

Gipsel

astrograd

Gipsel

ROG27

Shifty Geezer

uber-Troll!

3dilettante

Gipsel

function

None functional

Gipsel

3dcgi

Gipsel

function

None functional

Similar threads