Xbox One (Durango) Technical hardware investigation

QualityAssurance · Jun 30, 2013

Alright then, so I'm not behind on the times. A bus that can simultaneously read and write does not exist or does not exist within the next generation consoles.

DrJay24 · Jun 30, 2013

blakjedi said:
They are saying that its full duplex in certain cases with an 88% increase due to efficiency which brings it down from 204.8 gb/s to 192.512 gb/s. 102.4*1.88= 192.512.

The peak theoretical bandwidth would still be 204.8GB/s.

I'm interested in who gave this info to DF, why they leaked it and why Leadbetter became a stenographer.

Gipsel · Jun 30, 2013

3dcgi said:
The extended ACE's are one of the custom features I'm referring to. Also, just because something exists in the GCN doc that was public for a bit doesn't mean it wasn't influenced by the console customers.

See, and I was thinking AMD had the extended ACEs in the works on ist own. During the GCN architecture presentation 2 years ago it was mentioned that the number is scalable. Providing more queues per ACE is a straightforward development in my opinion. And it is actually going to be useful on the desktop and in the HPC area, too. It provides similar functionality as nV's HyperQ-Feature (up to 32 queues for GK110). I would therefore argue, that this feature should have been in "GCN1.1" anyway. If it really needed Sony to push that decision, I would be unpleasently surprised.
Anyway, it looks like the people responsible for Temash and Kabini were thinking in that direction, too. It became the first available AMD chip using the updated GCN iteration. They integrated four of the new ACEs to the tiny GPU which is now capable of handling 32 queues, just as GK110. It appears to me that it matches the PS4 techlevel (or doesn't the PS4 support the FLAT addressing?).
I'm a bit surprised that the XB1 appears to be using the old ACEs (or ar that two new ones providing 16 queues in total? I guess MS would have mentioned it). Did XB1 stick with GCN1.0 because of more extended modifications to it?

expletive · Jun 30, 2013

Shifty Geezer said:
Yep. But more than that, most devs on PC have no need to know about such low-level mechanics either. It's only console devs that look so closely at the way the hardware works (the engine-developing ninja code-monkeys, anyhow).

Not sure I agree with this as he makes a clear comparison to the corresponding hardware in PS4 (and that the DB/CBs have the fastest memory in either system) Now he may be all together ignorant but I don't believe its that he's just getting his console developing feet wet.

Is there any detailed documentation on the equivalent CB/DB silicon in the PS4 to see if theres any differences?

Kaotik · Jun 30, 2013

3dcgi said:
Since Cerny mentioned it I'll comment on the volatile flag. I didn't know about it until he mentioned it, but looked it up and it really is new and driven by Sony. The extended ACE's are one of the custom features I'm referring to. Also, just because something exists in the GCN doc that was public for a bit doesn't mean it wasn't influenced by the console customers. That's one positive about working with consoles. Their engineers have specific requirements and really bang on the hardware and suggest new things that result in big or small improvements.

Amount of ACE units has always been scalable in GCN

Scott_Arm · Jun 30, 2013

Shifty Geezer said:
Yep. I think I'm confusing things. AFAIK buses are limited to one bit of info in one direction per clocks. Dealing with two signals on one wire in opposite directions at the same time can't be easy, especially at electronics frequencies and component sizes.

Yeah, but 'duplex' isn't really viable. No bus sends and receives data simultaneous in electronics, so the idea that the GPU<>ESRAM BUS can do this is extraordinarily unrealistic AFAIK.

The only obvious way to get more info across the same bus at the same clock speed is to do something funky with the clocks and manage to squeeze an extra 'cycle' in there (or rather, on the down cycle send info one way, and on the up signal send data the other way). That's not something you'd just come across by chance though! If it's even possible.

You can't have two devices driving the same bus line at the same time. Your clock will limit how many times you can assert a signal on a bus line at a given time. The only way to have a duplex bus would be to have two busses, one for read and one for write. You'd also need two address busses. I really can't see that being a possibility here. That article made little sense to me. There is no way they're reading and writing on the same bus at the same time.

Edit: I don't know what voltage these devices operate on, but I'll say 5V for fun. If 4V or higher is a 1 and 1V or lower is 0, how do you send a 1 and a 0 on the same bus line at the same time? You can't. This problem is called bus contention.

3dcgi · Jun 30, 2013

Gipsel said:
See, and I was thinking AMD had the extended ACEs in the works on ist own. During the GCN architecture presentation 2 years ago it was mentioned that the number is scalable. Providing more queues per ACE is a straightforward development in my opinion. And it is actually going to be useful on the desktop and in the HPC area, too. It provides similar functionality as nV's HyperQ-Feature (up to 32 queues for GK110). I would therefore argue, that this feature should have been in "GCN1.1" anyway. If it really needed Sony to push that decision, I would be unpleasently surprised.
Anyway, it looks like the people responsible for Temash and Kabini were thinking in that direction, too. It became the first available AMD chip using the updated GCN iteration. They integrated four of the new ACEs to the tiny GPU which is now capable of handling 32 queues, just as GK110. It appears to me that it matches the PS4 techlevel (or doesn't the PS4 support the FLAT addressing?).
I'm a bit surprised that the XB1 appears to be using the old ACEs (or ar that two new ones providing 16 queues in total? I guess MS would have mentioned it). Did XB1 stick with GCN1.0 because of more extended modifications to it?

There's gray area in terms of custom features. Some features wouldn't exist without a customer pushing for them and others can be influenced via collaboration during development. The compute features would certainly have evolved on their own, without console customer feedback.

Xenus · Jun 30, 2013

Indeed ot seems the deeper this rabbit hole goes the more nonsensical it becomes. If it is true there is some very vital piece of info missing otherwise it's just technobabble( nonsense that uses pretty tech terms).

zupallinere · Jun 30, 2013

DrJay24 said:
I'm interested in who gave this info to DF, why they leaked it and why Leadbetter became a stenographer.

In a subsequent twitter exchanges ( don't know who is writing for DF ):

Digital Foundry ‏@digitalfoundry 29 Jun
Xbox One ESRAM peak bandwidth is still 102.4GB/s on pure reads or writes, so I assume 128 byte x 800MHz calculation still holds true.

Michael Coleman ‏@MichaelMikado 29 Jun
@digitalfoundry r u saying only 88% of the memory can read/write simultaneously? Please explain how that's technically possible? What part??

@MichaelMikado no that's not what it says, did you read the article?

and

Digital Foundry ‏@digitalfoundry
Xbox One ESRAM peak bandwidth is still 102.4GB/s on pure reads or writes, so I assume 128 byte x 800MHz calculation still holds true.

Ali ‏@phosphor112 29 Jun
@digitalfoundry Where does the 88% come from? The ~133GB/s is the "realistic" BW of the chip, so how do they come up with that 88%?

Digital Foundry ‏@digitalfoundry 23h
@phosphor112 good question, MS didn't go into depth on that one

It seems that DF "knows" that 88% doesn't mean that simultaneous read/writes are happening 88% of the time but the article states that simultaneous read/writes are occurring.

Xbox One processor is considerably more capable than Microsoft envisaged during pre-production of the console, with data throughput levels up to 88 per cent higher in the final hardware.

So the 88% is a combination of "simultaneous" read/writes and some other mechanism.

DF also says that that it doesn't know where that 88% number comes from.

Seems to me either their "trusted sources" don't know what they are talking about or the MS representative doesn't, either way DF doesn't seem to care.

Scott_Arm · Jun 30, 2013

Xenus said:
Indeed ot seems the deeper this rabbit hole goes the more nonsensical it becomes. If it is true there is some very vital piece of info missing otherwise it's just technobabble( nonsense that uses pretty tech terms).

There just isn't nearly enough information in that story to understand what's going on. I can't see why Microsoft would try to deceive developers with creative math. From what Digital Foundry reported there's no way to figure out those numbers without a huge amount of guesswork. I imagine there's also some confusion of the information as it changes hands as well. There's no way the engineers just realized the bandwidth is greater than they designed.

zupallinere · Jun 30, 2013

Scott_Arm said:
There's no way the engineers just realized the bandwidth is greater than they designed.

So with that logical statement we should come to the conclusion that the MS engineers did design something that performant but didn't know if it would work during subsequent "spins" of the hardware ?

expletive · Jun 30, 2013

Scott_Arm said:
There just isn't nearly enough information in that story to understand what's going on. I can't see why Microsoft would try to deceive developers with creative math. From what Digital Foundry reported there's no way to figure out those numbers without a huge amount of guesswork. I imagine there's also some confusion of the information as it changes hands as well. There's no way the engineers just realized the bandwidth is greater than they designed.

Could it be that they wanted to be conservative with the ESRAM bandwidth numbers because they weren't sure how aggressive they could be with the timings until the had final hardware? Now maybe that all silicon is in place they are finding that its a best case scenario with regards how aggressive they can be with the ESRAM and the other components' access to it.

Probably better to have this as a pleasant surprise to developers now than have to pull back on bandwidth, especially since how critical the ESRAM will be to performance tuning. In addition, these theoretical peaks aren't in possible in all scenarios, so that's another reason to start by communicating the "known good" specs early on instead of saying something like "well its 102GB/s all the time but in some cases it *might* be 133GB/s, and then in some even more rare cases it might be able to do 192GB/s, etc" Especially when the 133 and 192 hadn't been proven out with final hardware yet.

zupallinere · Jun 30, 2013

expletive said:
Probably better to have this as a pleasant surprise to developers now than have to pull back on bandwidth, especially since how critical the ESRAM will be to performance tuning. In addition, these theoretical peaks aren't in possible in all scenarios, so that's another reason to start by communicating the "known good" specs early on instead of saying something like "well its 102GB/s all the time but in some cases it *might* be 133GB/s, and then in some even more rare cases it might be able to do 192GB/s, etc" Especially when the 133 and 192 hadn't been proven out with final hardware yet.

Either the clocks are bopping between different frequency ranges or they are doing some sort of DDR every once in a while ??

From the DR twitter account:

pure reads or writes

Meaning there is another kind of read/write ? Is there some "in between" read/write state that STILL is able to carry information ???

Shifty Geezer · Jun 30, 2013

expletive said:
Not sure I agree with this as he makes a clear comparison to the corresponding hardware in PS4 (and that the DB/CBs have the fastest memory in either system) Now he may be all together ignorant but I don't believe its that he's just getting his console developing feet wet.

Aeoniss's quote says the guy has no insider knowledge. I don't know where Aeoniss is getting that from. But if you go just by technical documentation, you have MS saying, "we've got these great DB and CB blocks that can do funky stuff" and your got Sony not saying that. The logical conclusion is to believe they are something unique to MS if you don't know any better.

Is there any detailed documentation on the equivalent CB/DB silicon in the PS4 to see if theres any differences?

There are no details on the blocks in XB1 to compare the GCN RBE to.

Here's AMD's GCN whitepaper with a pretty picture for 7970, showing "color ROP units" and "z/stencil ROP units". I don't know how these differ to the remark about XB1. The source claimed, "Fast render paths and pre-out on depth tests at a vertex level? This isn't something the PS4 GPU can do." If you search the webs for the DB/CB discussion surrounding those leaks, smart-sounding people said it's no different to GCN anyway.

Given he's an indie, likely hasn't got a devkit, and would be under NDA if he had making discussion about XB1 dangerous, I'm inclined to believe he's speculating and just comparing what's known without real technical depth. I may be wrong though, but it's not coming across as something new and insightful.

Gipsel · Jun 30, 2013

Shifty Geezer said:
There are no details on the blocks in XB1 to compare the GCN RBE to.

This is what Microsoft wrote about them:

XB1 documentation given to Devs said:
Output

Pixel shading output goes through the DB and CB before being written to the depth/stencil and color render targets. Logically, these buffers represent screenspace arrays, with one value per sample. Physically, implementation of these buffers is much more complex, and involves a number of optimizations in hardware.

Both depth and color are stored in compressed formats. The purpose of compression is to save bandwidth, not memory, and, in fact, compressed render targets actually require slightly more memory than their uncompressed analogues. Compressed render targets provide for certain types of fast-path rendering. A clear operation, for example, is much faster in the presence of compression, because the GPU does not need to explicitly write the clear value to every sample. Similarly, for relatively large triangles, MSAA rendering to a compressed color buffer can run at nearly the same rate as non-MSAA rendering.

For performance reasons, it is important to keep depth and color data compressed as much as possible. Some examples of operations which can destroy compression are:
•Rendering highly tessellated geometry
•Heavy use of alpha-to-mask (sometimes called alpha-to-coverage)
•Writing to depth or stencil from a pixel shader
•Running the pixel shader per-sample (using the SV_SampleIndex semantic)
•Sourcing the depth or color buffer as a texture in-place and then resuming use as a render target

Both the DB and the CB have substantial caches on die, and all depth and color operations are performed locally in the caches. Access to these caches is faster than access to ESRAM. For this reason, the peak GPU pixel rate can be larger than what raw memory throughput would indicate. The caches are not large enough, however, to fit entire render targets. Therefore, rendering that is localized to a particular area of the screen is more efficient than scattered rendering.

Fill

The GPU contains four physical instances of both the CB and the DB. Each is capable of handling one quad per clock cycle for a total throughput of 16 pixels per clock cycle, or 12.8 Gpixel/sec. The CB is optimized for 64-bit-per-pixel types, so there is no local performance advantage in using smaller color formats, although there may still be a substantial bandwidth savings.

Because alpha-blending requires both a read and a write, it potentially consumes twice the bandwidth of opaque rendering, and for some color formats, it also runs at half rate computationally. Likewise, because depth testing involves a read from the depth buffer, and depth update involves a write to the depth buffer, enabling either state can reduce overall performance.

Depth and Stencil

The depth block occurs near the end of the logical rendering pipeline, after the pixel shader. In the GPU implementation, however, the DB and the CB can interact with rendering both before and after pixel shading, and the pipeline supports several types of optimized early decision pathways. Durango implements both hierarchical Z (Hi-Z) and early Z (and the same for stencil). Using careful driver and hardware logic, certain depth and color operations can be moved before the pixel shader, and in some cases, part or all of the cost of shading and rasterization can be avoided.

Depth and stencil are stored and handled separately by the hardware, even though syntactically they are treated as a unit. A read of depth/stencil is really two distinct operations, as is a write to depth/stencil. The driver implements the mixed format DXGI_FORMAT_D24_UNORM_S8_UINT by using two separate allocations: a 32-bit depth surface (with 8 bits of padding per sample) and an 8-bit stencil surface.

Antialiasing

The Durango GPU supports 2x, 4x, and 8x MSAA levels. It also implements a modified type of MSAA known as compressed AA. Compressed AA decouples two notions of sample:
•Coverage sample–One of several screenspace positions generated by rasterization of one pixel
•Surface sample– One of several entries representing a single pixel in a color or depth/stencil surface

Traditionally, coverage samples and surface samples match up one to one. In standard 4xMSAA, for example, a triangle may cover from zero to four samples of any given pixel, and a depth and a color are recorded for each covered sample.

Under compressed AA, there can be more coverage samples than surface samples. In other words, a triangle may still cover several screenspace locations per pixel, but the GPU does not allocate enough render target space to store a unique depth and color for each location. Hardware logic determines how to combine data from multiple coverage samples. In areas of the screen with extensive subpixel detail, this data reduction process is lossy, but the errors are generally unobjectionable. Compressed AA combines most of the quality benefits of high MSAA levels with the relaxed space requirements of lower MSAA levels.

That sounds not very special. The bolded part right before the "Fill" section describes exactly the ROP caches (16 kB color and 4kB Z per RBE). Sebbbi's test with a discrete GCN card demonstrated exactly what they are claiming there, namely that localized access (he used 128x128 pixel tiling of the render target) gains you efficiency. Using that with 64bit color blending, you could potentially double the apparent bandwidth available to the ROPs over what the eSRAM provides (and with non-optimal access patterns it converges eventually to the eSRAM bandwidth, but the claimed 133 GB/s could be possible under certain circumstances also without tiling, that they say 192GB/s is the peak and not 204.8GB/s could be some inherent inefficiency in the ROPs hindering them to reach their theoretical peak or some bandwidth necessary for the Z buffer they neglected somehow or something). As I said, I hope it's not just that as every GPU from the last years can do that.
Btw., if something wonders what this compressed AA is, AMD names it usually EQAA and it is basically equivalent to nV's CSAA. The Cayman ROPs started to support it and all GCN GPUs can do it to.

zupallinere · Jun 30, 2013

Gipsel said:
This is what Microsoft wrote about them:

... cool stuff ...

Click to expand...

That sounds not very special. The bolded part right before the "Fill" section describes exactly the ROP caches (16 kB color and 4kB Z per RBE). Sebbbi's test with a discrete GCN card demonstrated exactly what they are claiming there, namely that localized access (he used 128x128 pixel tiling of the render target) gains you efficiency. Using that with 64bit color blending, you could potentially double the apparent bandwidth available to the ROPs over what the eSRAM provides (and with non-optimal access patterns it converges eventually to the eSRAM bandwidth, but the claimed 133 GB/s could be possible under certain circumstances also without tiling, that they say 192GB/s as pak and ot 204.8GB/s could be some inherent inefficiency in the ROPs hindering to reach its theoretical peak or some bandwidth necessary for the Z buffers they neglected somehow or something). As I said, I hope it's not just that as every GPU from the last years can do that.

So 'diagnostics" or a "test suite" of sorts could now be reporting this 88% increase in throughput that could be thought of as an increase in apparent bandwidth performance ?

Sorry for the use of "quotes" around the stuff I have very little info on

Gipsel · Jun 30, 2013

zupallinere said:
So 'diagnostics" or a "test suite" of sorts could now be reporting this 88% increase in throughput that could be thought of as an increase in apparent bandwidth performance ?

If they would try to deliberately misrepresent the capabilities of the hardware, yes. That's why I hope it's not the explanation.

zupallinere · Jun 30, 2013

Gipsel said:
If they would try to deliberately misrepresent the capabilities of the hardware, yes. That's why I hope it's not the explanation.

Gotcha.

Although maybe they were discussing performance characteristics of the virtual machine drivers ?

Technically that could be considered an increase in hardware performance.

Gipsel · Jun 30, 2013

zupallinere said:
Technically that could be considered an increase in hardware performance.

Technically, it's the performance you expect from 16 ROPs with their specialized caches when backed up by 102.4 GB/s of bandwidth. Nothing increased at all.

expletive · Jun 30, 2013

Depth and Stencil

The depth block occurs near the end of the logical rendering pipeline, after the pixel shader. In the GPU implementation, however, the DB and the CB can interact with rendering both before and after pixel shading, and the pipeline supports several types of optimized early decision pathways. Durango implements both hierarchical Z (Hi-Z) and early Z (and the same for stencil). Using careful driver and hardware logic, certain depth and color operations can be moved before the pixel shader, and in some cases, part or all of the cost of shading and rasterization can be avoided.

Depth and stencil are stored and handled separately by the hardware, even though syntactically they are treated as a unit. A read of depth/stencil is really two distinct operations, as is a write to depth/stencil. The driver implements the mixed format DXGI_FORMAT_D24_UNORM_S8_UINT by using two separate allocations: a 32-bit depth surface (with 8 bits of padding per sample) and an 8-bit stencil surface.

People underestimate those under-spoken DB/CB blocks. They are extremely useful, powerful and far faster than any memory on either the PS4 or Xbox One memory system.

Fast render paths and pre-out on depth tests at a vertex level? This isn't something the PS4 GPU can do.

A depth test takes up zero bandwidth when done (correctly)on the Xbox one, that's a soak on the PS4's bandwidth.

Depth and Stencil sounds like where he is deriving the quote above. Would that be accurate and is this functionality peculiar to XBO? If it is, is it significant in any way?

Xbox One (Durango) Technical hardware investigation

QualityAssurance

DrJay24

Gipsel

expletive

Kaotik

Drunk Member

Scott_Arm

3dcgi

Xenus

zupallinere

Scott_Arm

zupallinere

expletive

zupallinere

Shifty Geezer

uber-Troll!

Gipsel

zupallinere

Gipsel

zupallinere

Gipsel

expletive

Similar threads