QualityAssurance
Newcomer
Alright then, so I'm not behind on the times. A bus that can simultaneously read and write does not exist or does not exist within the next generation consoles.
They are saying that its full duplex in certain cases with an 88% increase due to efficiency which brings it down from 204.8 gb/s to 192.512 gb/s. 102.4*1.88= 192.512.
See, and I was thinking AMD had the extended ACEs in the works on ist own. During the GCN architecture presentation 2 years ago it was mentioned that the number is scalable. Providing more queues per ACE is a straightforward development in my opinion. And it is actually going to be useful on the desktop and in the HPC area, too. It provides similar functionality as nV's HyperQ-Feature (up to 32 queues for GK110). I would therefore argue, that this feature should have been in "GCN1.1" anyway. If it really needed Sony to push that decision, I would be unpleasently surprised.The extended ACE's are one of the custom features I'm referring to. Also, just because something exists in the GCN doc that was public for a bit doesn't mean it wasn't influenced by the console customers.
Yep. But more than that, most devs on PC have no need to know about such low-level mechanics either. It's only console devs that look so closely at the way the hardware works (the engine-developing ninja code-monkeys, anyhow).
Since Cerny mentioned it I'll comment on the volatile flag. I didn't know about it until he mentioned it, but looked it up and it really is new and driven by Sony. The extended ACE's are one of the custom features I'm referring to. Also, just because something exists in the GCN doc that was public for a bit doesn't mean it wasn't influenced by the console customers. That's one positive about working with consoles. Their engineers have specific requirements and really bang on the hardware and suggest new things that result in big or small improvements.
Yep. I think I'm confusing things. AFAIK buses are limited to one bit of info in one direction per clocks. Dealing with two signals on one wire in opposite directions at the same time can't be easy, especially at electronics frequencies and component sizes.
Yeah, but 'duplex' isn't really viable. No bus sends and receives data simultaneous in electronics, so the idea that the GPU<>ESRAM BUS can do this is extraordinarily unrealistic AFAIK.
The only obvious way to get more info across the same bus at the same clock speed is to do something funky with the clocks and manage to squeeze an extra 'cycle' in there (or rather, on the down cycle send info one way, and on the up signal send data the other way). That's not something you'd just come across by chance though! If it's even possible.
There's gray area in terms of custom features. Some features wouldn't exist without a customer pushing for them and others can be influenced via collaboration during development. The compute features would certainly have evolved on their own, without console customer feedback.See, and I was thinking AMD had the extended ACEs in the works on ist own. During the GCN architecture presentation 2 years ago it was mentioned that the number is scalable. Providing more queues per ACE is a straightforward development in my opinion. And it is actually going to be useful on the desktop and in the HPC area, too. It provides similar functionality as nV's HyperQ-Feature (up to 32 queues for GK110). I would therefore argue, that this feature should have been in "GCN1.1" anyway. If it really needed Sony to push that decision, I would be unpleasently surprised.
Anyway, it looks like the people responsible for Temash and Kabini were thinking in that direction, too. It became the first available AMD chip using the updated GCN iteration. They integrated four of the new ACEs to the tiny GPU which is now capable of handling 32 queues, just as GK110. It appears to me that it matches the PS4 techlevel (or doesn't the PS4 support the FLAT addressing?).
I'm a bit surprised that the XB1 appears to be using the old ACEs (or ar that two new ones providing 16 queues in total? I guess MS would have mentioned it). Did XB1 stick with GCN1.0 because of more extended modifications to it?
I'm interested in who gave this info to DF, why they leaked it and why Leadbetter became a stenographer.
Digital Foundry @digitalfoundry 29 Jun
Xbox One ESRAM peak bandwidth is still 102.4GB/s on pure reads or writes, so I assume 128 byte x 800MHz calculation still holds true.
Michael Coleman @MichaelMikado 29 Jun
@digitalfoundry r u saying only 88% of the memory can read/write simultaneously? Please explain how that's technically possible? What part??
@MichaelMikado no that's not what it says, did you read the article?
Digital Foundry @digitalfoundry
Xbox One ESRAM peak bandwidth is still 102.4GB/s on pure reads or writes, so I assume 128 byte x 800MHz calculation still holds true.
Ali @phosphor112 29 Jun
@digitalfoundry Where does the 88% come from? The ~133GB/s is the "realistic" BW of the chip, so how do they come up with that 88%?
Digital Foundry @digitalfoundry 23h
@phosphor112 good question, MS didn't go into depth on that one
Xbox One processor is considerably more capable than Microsoft envisaged during pre-production of the console, with data throughput levels up to 88 per cent higher in the final hardware.
Indeed ot seems the deeper this rabbit hole goes the more nonsensical it becomes. If it is true there is some very vital piece of info missing otherwise it's just technobabble( nonsense that uses pretty tech terms).
There's no way the engineers just realized the bandwidth is greater than they designed.
There just isn't nearly enough information in that story to understand what's going on. I can't see why Microsoft would try to deceive developers with creative math. From what Digital Foundry reported there's no way to figure out those numbers without a huge amount of guesswork. I imagine there's also some confusion of the information as it changes hands as well. There's no way the engineers just realized the bandwidth is greater than they designed.
Probably better to have this as a pleasant surprise to developers now than have to pull back on bandwidth, especially since how critical the ESRAM will be to performance tuning. In addition, these theoretical peaks aren't in possible in all scenarios, so that's another reason to start by communicating the "known good" specs early on instead of saying something like "well its 102GB/s all the time but in some cases it *might* be 133GB/s, and then in some even more rare cases it might be able to do 192GB/s, etc" Especially when the 133 and 192 hadn't been proven out with final hardware yet.
pure reads or writes
Aeoniss's quote says the guy has no insider knowledge. I don't know where Aeoniss is getting that from. But if you go just by technical documentation, you have MS saying, "we've got these great DB and CB blocks that can do funky stuff" and your got Sony not saying that. The logical conclusion is to believe they are something unique to MS if you don't know any better.Not sure I agree with this as he makes a clear comparison to the corresponding hardware in PS4 (and that the DB/CBs have the fastest memory in either system) Now he may be all together ignorant but I don't believe its that he's just getting his console developing feet wet.
There are no details on the blocks in XB1 to compare the GCN RBE to.Is there any detailed documentation on the equivalent CB/DB silicon in the PS4 to see if theres any differences?
This is what Microsoft wrote about them:There are no details on the blocks in XB1 to compare the GCN RBE to.
That sounds not very special. The bolded part right before the "Fill" section describes exactly the ROP caches (16 kB color and 4kB Z per RBE). Sebbbi's test with a discrete GCN card demonstrated exactly what they are claiming there, namely that localized access (he used 128x128 pixel tiling of the render target) gains you efficiency. Using that with 64bit color blending, you could potentially double the apparent bandwidth available to the ROPs over what the eSRAM provides (and with non-optimal access patterns it converges eventually to the eSRAM bandwidth, but the claimed 133 GB/s could be possible under certain circumstances also without tiling, that they say 192GB/s is the peak and not 204.8GB/s could be some inherent inefficiency in the ROPs hindering them to reach their theoretical peak or some bandwidth necessary for the Z buffer they neglected somehow or something). As I said, I hope it's not just that as every GPU from the last years can do that.XB1 documentation given to Devs said:Output
Pixel shading output goes through the DB and CB before being written to the depth/stencil and color render targets. Logically, these buffers represent screenspace arrays, with one value per sample. Physically, implementation of these buffers is much more complex, and involves a number of optimizations in hardware.
Both depth and color are stored in compressed formats. The purpose of compression is to save bandwidth, not memory, and, in fact, compressed render targets actually require slightly more memory than their uncompressed analogues. Compressed render targets provide for certain types of fast-path rendering. A clear operation, for example, is much faster in the presence of compression, because the GPU does not need to explicitly write the clear value to every sample. Similarly, for relatively large triangles, MSAA rendering to a compressed color buffer can run at nearly the same rate as non-MSAA rendering.
For performance reasons, it is important to keep depth and color data compressed as much as possible. Some examples of operations which can destroy compression are:
•Rendering highly tessellated geometry
•Heavy use of alpha-to-mask (sometimes called alpha-to-coverage)
•Writing to depth or stencil from a pixel shader
•Running the pixel shader per-sample (using the SV_SampleIndex semantic)
•Sourcing the depth or color buffer as a texture in-place and then resuming use as a render target
Both the DB and the CB have substantial caches on die, and all depth and color operations are performed locally in the caches. Access to these caches is faster than access to ESRAM. For this reason, the peak GPU pixel rate can be larger than what raw memory throughput would indicate. The caches are not large enough, however, to fit entire render targets. Therefore, rendering that is localized to a particular area of the screen is more efficient than scattered rendering.
Fill
The GPU contains four physical instances of both the CB and the DB. Each is capable of handling one quad per clock cycle for a total throughput of 16 pixels per clock cycle, or 12.8 Gpixel/sec. The CB is optimized for 64-bit-per-pixel types, so there is no local performance advantage in using smaller color formats, although there may still be a substantial bandwidth savings.
Because alpha-blending requires both a read and a write, it potentially consumes twice the bandwidth of opaque rendering, and for some color formats, it also runs at half rate computationally. Likewise, because depth testing involves a read from the depth buffer, and depth update involves a write to the depth buffer, enabling either state can reduce overall performance.
Depth and Stencil
The depth block occurs near the end of the logical rendering pipeline, after the pixel shader. In the GPU implementation, however, the DB and the CB can interact with rendering both before and after pixel shading, and the pipeline supports several types of optimized early decision pathways. Durango implements both hierarchical Z (Hi-Z) and early Z (and the same for stencil). Using careful driver and hardware logic, certain depth and color operations can be moved before the pixel shader, and in some cases, part or all of the cost of shading and rasterization can be avoided.
Depth and stencil are stored and handled separately by the hardware, even though syntactically they are treated as a unit. A read of depth/stencil is really two distinct operations, as is a write to depth/stencil. The driver implements the mixed format DXGI_FORMAT_D24_UNORM_S8_UINT by using two separate allocations: a 32-bit depth surface (with 8 bits of padding per sample) and an 8-bit stencil surface.
Antialiasing
The Durango GPU supports 2x, 4x, and 8x MSAA levels. It also implements a modified type of MSAA known as compressed AA. Compressed AA decouples two notions of sample:
•Coverage sample–One of several screenspace positions generated by rasterization of one pixel
•Surface sample– One of several entries representing a single pixel in a color or depth/stencil surface
Traditionally, coverage samples and surface samples match up one to one. In standard 4xMSAA, for example, a triangle may cover from zero to four samples of any given pixel, and a depth and a color are recorded for each covered sample.
Under compressed AA, there can be more coverage samples than surface samples. In other words, a triangle may still cover several screenspace locations per pixel, but the GPU does not allocate enough render target space to store a unique depth and color for each location. Hardware logic determines how to combine data from multiple coverage samples. In areas of the screen with extensive subpixel detail, this data reduction process is lossy, but the errors are generally unobjectionable. Compressed AA combines most of the quality benefits of high MSAA levels with the relaxed space requirements of lower MSAA levels.
This is what Microsoft wrote about them:
... cool stuff ...
That sounds not very special. The bolded part right before the "Fill" section describes exactly the ROP caches (16 kB color and 4kB Z per RBE). Sebbbi's test with a discrete GCN card demonstrated exactly what they are claiming there, namely that localized access (he used 128x128 pixel tiling of the render target) gains you efficiency. Using that with 64bit color blending, you could potentially double the apparent bandwidth available to the ROPs over what the eSRAM provides (and with non-optimal access patterns it converges eventually to the eSRAM bandwidth, but the claimed 133 GB/s could be possible under certain circumstances also without tiling, that they say 192GB/s as pak and ot 204.8GB/s could be some inherent inefficiency in the ROPs hindering to reach its theoretical peak or some bandwidth necessary for the Z buffers they neglected somehow or something). As I said, I hope it's not just that as every GPU from the last years can do that.
If they would try to deliberately misrepresent the capabilities of the hardware, yes. That's why I hope it's not the explanation.So 'diagnostics" or a "test suite" of sorts could now be reporting this 88% increase in throughput that could be thought of as an increase in apparent bandwidth performance ?
If they would try to deliberately misrepresent the capabilities of the hardware, yes. That's why I hope it's not the explanation.
Technically, it's the performance you expect from 16 ROPs with their specialized caches when backed up by 102.4 GB/s of bandwidth. Nothing increased at all.Technically that could be considered an increase in hardware performance.
Depth and Stencil
The depth block occurs near the end of the logical rendering pipeline, after the pixel shader. In the GPU implementation, however, the DB and the CB can interact with rendering both before and after pixel shading, and the pipeline supports several types of optimized early decision pathways. Durango implements both hierarchical Z (Hi-Z) and early Z (and the same for stencil). Using careful driver and hardware logic, certain depth and color operations can be moved before the pixel shader, and in some cases, part or all of the cost of shading and rasterization can be avoided.
Depth and stencil are stored and handled separately by the hardware, even though syntactically they are treated as a unit. A read of depth/stencil is really two distinct operations, as is a write to depth/stencil. The driver implements the mixed format DXGI_FORMAT_D24_UNORM_S8_UINT by using two separate allocations: a 32-bit depth surface (with 8 bits of padding per sample) and an 8-bit stencil surface.
People underestimate those under-spoken DB/CB blocks. They are extremely useful, powerful and far faster than any memory on either the PS4 or Xbox One memory system.
Fast render paths and pre-out on depth tests at a vertex level? This isn't something the PS4 GPU can do.
A depth test takes up zero bandwidth when done (correctly)on the Xbox one, that's a soak on the PS4's bandwidth.