Xbox One (Durango) Technical hardware investigation

Rangers · Jun 29, 2013

scently said:
That tweet is neither contradicting nor clarifying what was in the article.

Somewhat clarifying, he's reiterating things in the article again, as if to say this is the most current info.

In the article he was a bit vague too, here he's flat saying the BW is still 102.4 on pure reads or writes, as if fact recently conveyed to him.

but not clarifying the rest of the confusing stuff, no.

it's worth noting astrograd through his own microsoft source tipped me off the esram bw was considerably higher than vgleaks numbers weeks ago (at which time i was skeptical). this article doesn't come totally out of the blue to me, so in a sense that's a bit of corroboration of the article. Not that it needed any I suppose.

warb · Jun 29, 2013

scently said:
That tweet is neither contradicting nor clarifying what was in the article.

It at least clarifies that there has yet been no downclock, if the article was not clear enough.

The weird +88%, 192GB/s theoretical peak is derived from other than combining pure read and write speeds, this seems likely given the example 133GB/s.

Aeoniss · Jun 29, 2013

Indie dev (who doesn't have insider knowledge just to clear the air) regarding Mr. Ledbetter's recent article on ESRAM bandwidth and how Xbone titles will likely be at a lower resolution compared to other next gen consoles:

"Uh, no it doesn't, he has no basis for that for starters.
He's heard that the Xbox One driver software is less mature, and we all know that.
But he likes to down play the GPU at ever turn, he says the ESRAM is capable of this, but probably only really this (133gb/s).

The things is, he is likely right, but I can also tell you the same hold true for Sony's theoretical 176 gb/s.
It aint about to hit that anytime soon, nor does it do it half as efficient as the Xbox One GPU.

He doesn't even go into the implications of concurrent read/write (which is something i have been banging on about for a wee while now). 2 bus's are better than one. One alone being able to do it concurrently is even nicer.

People underestimate those under-spoken DB/CB blocks. They are extremely useful, powerful and far faster than any memory on either the PS4 or Xbox One memory system.

Fast render paths and pre-out on depth tests at a vertex level? This isn't something the PS4 GPU can do.

A depth test takes up zero bandwidth when done (correctly)on the Xbox one, that's a soak on the PS4's bandwidth.

While the GPU on the PS4 is strong in compute,its likely to be held at arms length in terms of render speed, even with twice the ROP."

This isn't meant to be a "versus" discussion on my part, just posting an interesting analysis of the DF article concerning the ESRAM\GPU of the Xbone. Thoughts?

arijoytunir · Jun 29, 2013

Aeoniss said:
Indie dev (who doesn't have insider knowledge just to clear the air) regarding Mr. Ledbetter's recent article on ESRAM bandwidth and how Xbone titles will likely be at a lower resolution compared to other next gen consoles:

This isn't meant to be a "versus" discussion on my part, just posting an interesting analysis of the DF article concerning the ESRAM\GPU of the Xbone. Thoughts?

so he seems to like xbox one's gpu for its esram and ddr3 type memory setup ?

ROG27 · Jun 29, 2013

Shifty Geezer said:
If the bus was bidirectional and capable of 204.8 GB/s, we'd have heard it. That would have been the BW figure in the tech doc leaks, just as every BW figure is labelled clearly so devs know what they have to work with.

http://www.vgleaks.com/durango-memory-system-overview/

Audio/camera bus = 9 GB/s Read, 9 GB/s Write, clearly labelled
DDR3 = 68 GB/s Read and Write, clearly labelled
ESRAM = 102 GB/s Read and Write, clearly labelled

What you're suggesting is in reality the ESRAM BW was 102 GB/s Read, 102 GB/s Write, but MS didn't tell anyone this or label their tech documents as such!

The 192 GB/s figure is also not a clear bidirectional BW figure, otherwise there wouldn't be any confusion about what the BW rate is. It wouldn't be reported "The new bus is 192 GB/s but for some reason we only get 133 GB/s real use from it in general use."

I agree that the voodoo of the memory is a mystery, but there's no way the original bus was capable of 204.8 GB/s communication yet devs weren't told. Neither explanation works!

Which is why I am waiting for the inevitable follow up article giving a technical explanation for how this is all happening and describing the limited use case(s) for what seems like esram hackery.

Aeoniss · Jun 29, 2013

arijoytunir said:
so he seems to like xbox one's gpu for its esram and ddr3 type memory setup ?

I'm a hardware savvy guy, but I'm not that hardware savvy. He is clearly speaking highly of the design and the GPU- but I do not have the knowledge to discern the accuracy of his statements one way or another. I felt it was curiously positive compared to much of the analysis I've seen around here on B3D on the Xbone GPU.

Anyway, like I said, I'm not trying to turn this into a "versus!" but this felt like the most appropriate place to post.

sheng long · Jun 29, 2013

an interesting post .... could someone please explain this ...
"People underestimate those under-spoken DB/CB blocks"

What exactly is he referring to and what are the real world benefits??

ROG27 · Jun 29, 2013

Aeoniss said:
Indie dev (who doesn't have insider knowledge just to clear the air) regarding Mr. Ledbetter's recent article on ESRAM bandwidth and how Xbone titles will likely be at a lower resolution compared to other next gen consoles:

This isn't meant to be a "versus" discussion on my part, just posting an interesting analysis of the DF article concerning the ESRAM\GPU of the Xbone. Thoughts?

It seems that there are retrictions and limited use cases to getting the increased efficiencies from the enhanced Depth Blocks and Color Blocks in the XBOX One GPU, as per the vgleaks documentation:

"For performance reasons, it is important to keep depth and color data compressed as much as possible. Some examples of operations which can destroy compression are:

-Rendering highly tessellated geometry
-Heavy use of alpha-to-mask (sometimes called alpha-to-coverage)
-Writing to depth or stencil from a pixel shader
-Running the pixel shader per-sample (using the SV_SampleIndex semantic)
-Sourcing the depth or color buffer as a texture in-place and then resuming use as a render target"

The need for depth and color data to remain compressed in order to save bandwidth seems paramount in order to utilize them effectively. This is due to the need for shuffling the data in and out of the limited caches found within the DBs and CBs very quickly.

I suggest reading the info at this link: http://www.vgleaks.com/durango-gpu-2/3/

Shifty Geezer · Jun 29, 2013

DB = depth buffer. CB = colour buffer. Operations on these buffers are performed in cache rather than ESRAM.

I assume these are caches present in all GCN architecture rather than specific to XB1, but you'll have to wait for a more knowledgeable member to clarify.

Kb-Smoker · Jun 29, 2013

Since the peak BW of 102.4 GB/s hasnt change but has. It must mean the 2 measurements are not the same.

Seems they are comparing the esram BW to off chip BW. IE it take 133GB/s of BW off chip[ddr3] to match the performance of the on chip esram for certain task. It really saying the bus is 88% more efficient with certain tasks.

Aeoniss said:
Indie dev (who doesn't have insider knowledge just to clear the air) regarding Mr. Ledbetter's recent article on ESRAM bandwidth and how Xbone titles will likely be at a lower resolution compared to other next gen consoles:

This isn't meant to be a "versus" discussion on my part, just posting an interesting analysis of the DF article concerning the ESRAM\GPU of the Xbone. Thoughts?

Where is your source?

Is this it?
http://www.psu.com/forums/showthrea...-hugely-underestimated-claim-developers/page3

Thats because Leadbetter has no clue, completely and utterly clueless. Or has an agenda, I'm still trying to figure out if he's just thick, or is deliberately downplaying the GPU.

Seem to apply he does have some kind of information that isnt out there.

Gipsel · Jun 29, 2013

scently said:
I think this amplification of bandwidth comes from doing framebuffer op similar to the way that writes to the eDRAM in 360 can only hit 256gb/s if you are doing 4xMSAA, so that would mean that if you are not doing MSAA on it the effective bandwidth is 64gb/s. So in the X1's case, it seems that for typical or general use its normal or effective bandwidth is 102gb/s but if you are doing writes with alphablending you have an effective bandwidth of 133gb/s and the 192gb/s effective bandwidth is only if you are doing a particular type of read and or writes operations.

Anyway this is just my opinion based of what is available in the article and 360 eDRAM operation.

This exact functionality is part of the color and Z caches embedded to the render backends in all AMD GPUs for years. I hope MS didn't just discover, that one can do some data reuse within a tile from the very fast memory in there.

Aeoniss said:
Indie dev (who doesn't have insider knowledge just to clear the air) regarding Mr. Ledbetter's recent article on ESRAM bandwidth and how Xbone titles will likely be at a lower resolution compared to other next gen consoles:

This isn't meant to be a "versus" discussion on my part, just posting an interesting analysis of the DF article concerning the ESRAM\GPU of the Xbone. Thoughts?

Color and Z backend are the usual names when talking about AMD's ROPs. Each color backend contains usually 16 kB and each Z backend 4 kB cache. No ROP export directly writes the RAM (so probably also not the eSRAM in Durango) but these caches. They have enough bandwidth for all operations the ROPs are capable of, it's never going to be a bottleneck. So blending with a 64bit color format and 4x MSAA can be done at the speed the ROP are capable of within the cache (holding multiple tiles [probably 8x8 pixels in size] of the render target) just as with the eDRAM of the XB360. The ROP caches perform the same bandwidth amplification. As long as everything is happening in those caches, bandwidth is not a problem (search for the recent posts of sebbi where he tested exactly that). In case of MSAA, these caches also do the compression/decompression of the tiles when loading or storing to RAM. Only complete tiles are loaded or stored (fewer turnarounds on the memory bus increasing transfer efficiency). The Z backend also builds the hierarchical Z tree which is used for early Z tests (after the rasterizer, before pixel shader).
And it is also not forbidden to read the Z buffer already during some stage before the rasterizer and drop geometry there or use some other means. I think it is a recommended technique to do some culling and hidden surface removal as early as possible, especially with tesselation. So I have no clue what that supposed independent dev is talking about when he says that this is only possible on XB1. That's clearly not true (or he is referring to something else).

sheng long said:
an interesting post .... could someone please explain this ...
"People underestimate those under-spoken DB/CB blocks"

What exactly is he referring to and what are the real world benefits??

I have no idea what they are supposed to do on top of what they are doing in all AMD GPUs already for years: increasing transfer efficiency, offering higher local bandwidth to the ROPs, doing the color and Z de-/compression as well as enabling short term data resuse in the ROP caches reducing the needed memory bandwidth slightly (or slightly more, depending on the access pattern, i.e. the rendered geometry and how the fragments arrive at the ROPs).

Aeoniss · Jun 29, 2013

Kb-Smoker said:
Since the peak BW of 102.4 GB/s hasnt change but has. It must mean the 2 measurements are not the same.

Seems they are comparing the esram BW to off chip BW. IE it take 133GB/s of BW off chip[ddr3] to match the performance of the on chip esram for certain task. It really saying the bus is 88% more efficient with certain tasks.

Where is your source?

Is this it?
http://www.psu.com/forums/showthrea...-hugely-underestimated-claim-developers/page3

Seem to apply he does have some kind of information that isnt out there.

I can assure you, he doesn't have any extra information. We speak regularly.

Seems like reddit has also picked up on the alleged downclock math:

http://www.reddit.com/r/Games/comments/1h9ix3/so_did_microsoft_just_spin_a_downclocked_esram/

Silent_Buddha · Jun 29, 2013

Shifty Geezer said:
DB = depth buffer. CB = colour buffer. Operations on these buffers are performed in cache rather than ESRAM.

I assume these are caches present in all GCN architecture rather than specific to XB1, but you'll have to wait for a more knowledgeable member to clarify.

Most likely, but the question is whether Microsoft has enhanced them in any way. Either by making them larger or by including more of them or something else.

Regards,
SB

Shifty Geezer · Jun 29, 2013

Perhaps, but that doesn't quite gel with the supposed Indie devs comment that people underestimate the DB/CB blocks. For all we've been told, they are nothing that GCN doesn't have. So either they are improved without anyone saying as much, in which case peoplpe can't be underestimating them because the information for basing our estimations is incorrect, or they aren't anything special and the Indie Dev doesn't know his GPU architectures well enough to appreciate there's nothing there special (as I understand it).

Many indie devs have little hardware knowledge because their domain is software, and few need to poke around with the inner workings of GPUs, so this guy having developer experience doesn't necessarily mean he's a voice of authority.

That doesn't shed any light on the current mystery of the Magical Multiplying Bandwidth either, which I think is the most pressing concern, unless someone can present evidence that the XB1's GPU has had more customisations than we're aware of.

astrograd · Jun 29, 2013

Brad Grenz said:
Where do you see a conflict in my math?

Can you point to it?

Cause I don't believe you can.

Your math assumes a factor of 2 that is incorrect. It also ignores a significant amount of attempts by dev sources to correct your assumption. If there was a downclock of the GPU, that means devs would absolutely be notified about it. MS isn't about to tell devs to go ahead and throw more data at the GPU while secretly reducing said chip's ability to chew through that data. The dev source for DF was told by MS how they found these spare holes to do ops in. That suggests that MS was offering devs some level of detailed explanation of their new math, which means they would have obviously mentioned the calculation involving the clock speed.

The article lays out some language that explains that your factor or 2 is wrong. It doesn't read+write on every single cycle. Once you remove your assumption that it does, the article's explanation and MS's behavior makes sense...but your math falls apart completely.

mrcorbo · Jun 29, 2013

Details would be nice so as to understand what exactly makes this (effective) bandwidth amplification possible. With 30+ years history of software engineers with low-level access to hardware finding ways to make that hardware work beyond and outside it's spec, though, it's hardly unprecedented.

RacVic · Jun 29, 2013

Has MS ever confirmed the the specific type of ESRAM used? I can't recall but 6T was what I have been reading from here and other places. Could it be they used 8T instead which allows for 2 write paths but 1 for reading which could allow for the possibilities mentioned in the DF story (writing ops could have more bandwidth available the reading ops)? It could also help to explain the extra transistors from the 5 billion some keep asking about.

Not an expert but only asking.

3dilettante · Jun 29, 2013

There's a number of ways within the spec of a memory hierarchy to provide performance optimizations or increase efficiency, and a lot of this is done automatically. A lot of testing in the PC realm runs afoul of optimizations it didn't control for, particularly memory bandwidth and latency benchmarks.

There's a dearth of information on the testing methods used to derive the numbers, and the vague theoretical peak can be arrived at in various ways. The trickle of information that comes out keeps fixating on peripheral numbers and incompletely described special cases instead of giving the stats of the interface itself, such as port count, width, speeds, and the number of reads and writes that can be sent to it.

ROG27 · Jun 29, 2013

Shifty Geezer said:
DB = depth buffer. CB = colour buffer. Operations on these buffers are performed in cache rather than ESRAM.

I assume these are caches present in all GCN architecture rather than specific to XB1, but you'll have to wait for a more knowledgeable member to clarify.

They seem to be enhance in x1 gpu to work with compressed data and esram. Apparently there are certain hardwired functions that can effectively truncate or totally avoid the path throough the pixel pipeline in some instances.

Gipsel · Jun 29, 2013

That's basically true for the ROPs in all GPUs in the recent years.

Xbox One (Durango) Technical hardware investigation

Rangers

warb

Aeoniss

arijoytunir

ROG27

Aeoniss

sheng long

ROG27

Shifty Geezer

uber-Troll!

Kb-Smoker

Gipsel

Aeoniss

Silent_Buddha

Shifty Geezer

uber-Troll!

astrograd

mrcorbo

Foo Fighter

RacVic

3dilettante

ROG27

Gipsel

Similar threads