Understanding XB1's internal memory bandwidth *spawn

mrcorbo · Sep 13, 2013

pjbliverpool said:
That was fun! So now that I think I understand how this all works, would it be fair to summarise this in laymans terms as follows (and in relation to a single pool of GDDR5 which is ultimately what we're all trying to understand):

The XB1 has 272 GB/s of theoretically useable bandwidth

Compared with a single pool of GDDR5 this has 2 disadvantages in real world utilisation:

It's made up of 2 seperate memory pools and so any direct copy of data between those pools without any sort of transformation to that data by the GPU will waste bandwidth performing an operation that wouldn't have been required in a single memory pool system.

Even with no data copy between memory pools the maximum useful bandwidth of the esram can only be achieved if you perfectly balance read and write operations at all time. Whereas with GDDR5 since the full bandwidth is available at all times in any ratio between read and write, that's not a concern.

So the bottom line is that the GPU in the XB1 will usually only be able to consume a smaller percentage of its total available bandwidth than the GPU of a different system that uses a single unified pool of GDDR5 (or DDR for that matter).

That that would depend on the practical limitations imposed by the specific set of conditions required to unlock that full bandwidth, how much the development tools are able to automatically optimize the bandwidth usage and how much effort developers are willing to put in to their own optimizations towards that end.

Shifty Geezer · Sep 13, 2013

mrcorbo said:
That that would depend on how much the development tools are able to automatically optimize the bandwidth usage and how much effort developers are willing to put in to their own optimizations towards that end.

No-one can answer the questions of how much BW games will actually use on XB1 by looking at the specs given us. Most importantly, and the purpose of this thread, we don't know the conditions affecting the flexible BW to/from eSRAM, and what amount of BW devs actually get, whether they are often getting 206 GB/s, or 103 GB/s, or 133 GB/s, or something in between. Given the reported average of 133 GB/s though, XB1 should be getting far less proportion of its peak 272 GB/s but only because of the oddities of the eSRAM bus and not because of the use of an eSRAM store.

mrcorbo · Sep 13, 2013

Shifty Geezer said:
No-one can answer the questions of how much BW games will actually use on XB1 by looking at the specs given us. Most importantly, and the purpose of this thread, we don't know the conditions affecting the flexible BW to/from eSRAM, and what amount of BW devs actually get, whether they are often getting 206 GB/s, or 103 GB/s, or 133 GB/s, or something in between. Given the reported average of 133 GB/s though, XB1 should be getting far less proportion of its peak 272 GB/s but only because of the oddities of the eSRAM bus and not because of the use of an eSRAM store.

I get your point and changed the statement I made to include that factor. You're mixing two different cases though when you list 133 GB/s, which is eSRAM only, and the 272 GB/s peak which includes main RAM bandwidth as well. It'd be more accurate to say 201 GB/s vs. 272 GB/s, no?

Cyan · Sep 13, 2013

3dilettante said:
If you're copying anything larger than 64 bytes, this isn't what copying would do.
Data would be read into the CUs in a minimum of 64 byte chunks, which is the length of a cache line.
That's multiple RAM and eSRAM transfers per read or write issued.
The CU can in a single instruction start the read process for multiple addresses.
The same goes for writes.
Once a read instruction is complete, the CU can either line up more reads or fire up a write.
It can issue the next read the next issue cycle after a write, even though that write's data hasn't gone all the way through the long pipeline and all the buffers on the way to the eSRAM.

The next bit of data is being read in, even as the first chunk is being written out.
After an initial filling of the pipeline, chunk N is being written as chunk N+1 is coming in.
The offset is likely higher, since the pipelines in question are likely significantly longer than the simplified example.

Code:

R0 R1 R2 R3 ... W0 W1 W2 ...

I must admit I am having a hard time understanding some concepts but it is always great to read posts like yours.

pjbliverpool said:
That was fun! So now that I think I understand how this all works, would it be fair to summarise this in laymans terms as follows (and in relation to a single pool of GDDR5 which is ultimately what we're all trying to understand):

The XB1 has 272 GB/s of theoretically useable bandwidth

Compared with a single pool of GDDR5 this has 2 disadvantages in real world utilisation:

It's made up of 2 seperate memory pools and so any direct copy of data between those pools without any sort of transformation to that data by the GPU will waste bandwidth performing an operation that wouldn't have been required in a single memory pool system.

Even with no data copy between memory pools the maximum useful bandwidth of the esram can only be achieved if you perfectly balance read and write operations at all time. Whereas with GDDR5 since the full bandwidth is available at all times in any ratio between read and write, that's not a concern.

So the bottom line is that the GPU in the XB1 will usually only be able to consume a smaller percentage of its total available bandwidth than the GPU of a different system that uses a single unified pool of GDDR5 (or DDR for that matter).

Isn't that what the DMEs are for then? It may not always work 'cos the bandwidth of the DMEs is smaller than the main memory and the eSRAM's bandwidth, but it can never hurt.

Shifty... I am sure you mentioned a couple of weeks ago that the AF was kind of granted with Tiled Resources --my mistake confusing your words with it being free. I could find your post but I feel a bit impatient at the moment 'cos I still have to have lunch and leave.

bkillian, we know that you could only write data to the EDRAM on the X360 and that you can read and write to the eSRAM on the Xbox One.

I do occasionally wonder... though, (because I agree with 3dilettante and Gipsel on the matter that it is unlikely they found out you could perform certain operations overnight), did you know that the eSRAM could write and read simultaneously before Microsoft officially announced it?

Betanumerical · Sep 13, 2013

Cyan said:
I must admit I am having a hard time understanding some concepts but it is always great to read posts like yours.

Isn't that what the DMEs are for then? It may not always work 'cos the bandwidth of the DMEs is smaller than the main memory and the eSRAM's bandwidth, but it can never hurt.

The DME's use the bandwidth of the DDR3/eSRAM to copy.

pjbliverpool · Sep 13, 2013

Cyan said:
Isn't that what the DMEs are for then? It may not always work 'cos the bandwidth of the DMEs is smaller than the main memory and the eSRAM's bandwidth, but it can never hurt.

As Betanumerical said, regardless of how you move the data you would have to use bandwidth of the source and target memory pools to read and write that data. This is way outside my knowledge now but I'm assuming the DME's bypass the GPU so that no GPU cycles have to be taken up to manage that data transfer.

I assume DME's do a lot more than simply manage data transfers between main memory and system RAM though since PC GPU's have something similar and if that was their only (or even primary) purpose in XB1 then that would imply there is an expectation of fairly significant data copy requirements between the 2 pools without any GPU intervention - which would be bad for bandwidth.

Shifty Geezer · Sep 13, 2013

mrcorbo said:
I get your point and changed the statement I made to include that factor. You're mixing two different cases though when you list 133 GB/s, which is eSRAM only, and the 272 GB/s peak which includes main RAM bandwidth as well. It'd be more accurate to say 201 GB/s vs. 272 GB/s, no?

Your right, but I was talking specifically about the BW to/from eSRAM.

Add DDR3 BW to those options to get total system BW devs may have access to, as you list. 201 GB/s < BW < 272 GB/s

bkilian · Sep 13, 2013

Cyan said:
bkillian, we know that you could only write data to the EDRAM on the X360 and that you can read and write to the eSRAM on the Xbox One.

I do occasionally wonder... though, (because I agree with 3dilettante and Gipsel on the matter that it is unlikely they found out you could perform certain operations overnight), did you know that the eSRAM could write and read simultaneously before Microsoft officially announced it?

No, I didn't. The documentation I had access to did not mention it. I suspect it was found by a hardware tester who kept getting crazy numbers on a bandwidth test.

Gipsel · Sep 13, 2013

bkilian said:
No, I didn't. The documentation I had access to did not mention it. I suspect it was found by a hardware tester who kept getting crazy numbers on a bandwidth test.

So you suspect a miscommunication between MS and AMD in the specification stage resulting in a sloppy specification of the bandwidth?

3dilettante · Sep 13, 2013

I feel that's also something they'd have a good idea about when they decided on the type of interface and the basis for the control logic.

A scenario I can think of is that they didn't promise a theoretical peak above their guaranteed minimum to peripheral teams and documentation because the memory subsystem is one of the last things to be finalized, going by other engineering samples.
The designers and engineers knew there would be some additional bandwidth possible, but because of the number of parameters they needed to tweak or bugs in early steppings, they knew the number could move around pretty easily.
Some of the final analysis would be dependent on what they learned from the production runs and reliability tests, indicating what timings would be acceptable for error rates, long-term reliability, and yield impact.

At some point, they get their numbers or a bug-fixed stepping, then someone benchmarking the hardware further down the food chain sees the end result.

bkilian · Sep 13, 2013

Gipsel said:
So you suspect a miscommunication between MS and AMD in the specification stage resulting in a sloppy specification of the bandwidth?

Or they wanted to do it but weren't sure it could be affordable and so the documentation was written without it. Or they asked for part X, and the chip designer found that they could include part Y for minimal extra cost, or the person writing the doc misunderstood the design, and it wasn't caught until recently... I don't know. I doubt we'll ever know. The documentation I read was written more than a year before we got the first engineering samples, so I'm not at all surprised things changed.

taisui · Sep 13, 2013

pjbliverpool said:
It's made up of 2 seperate memory pools and so any direct copy of data between those pools without any sort of transformation to that data by the GPU will waste bandwidth performing an operation that wouldn't have been required in a single memory pool system.

No true. If you are accessing the data then you are using BW to do it, even if within the same pool of memory, meaning that even on the PS4's single pool, in your own words, you still "waste" bandwidth when you memcpy data, it's not free.

pjbliverpool said:
Even with no data copy between memory pools the maximum useful bandwidth of the esram can only be achieved if you perfectly balance read and write operations at all time. Whereas with GDDR5 since the full bandwidth is available at all times in any ratio between read and write, that's not a concern.

So to follow your reasoning, then the X1 bandwidth has a "guaranteed" 68GBps on the DRAM + 109GBps on the ESRAM, with an additional 109GBps on the ESRAM due to simultaneous r/w when used perfectly.

pjbliverpool said:
So the bottom line is that the GPU in the XB1 will usually only be able to consume a smaller percentage of its total available bandwidth than the GPU of a different system that uses a single unified pool of GDDR5 (or DDR for that matter).

False assumption can't derive meaningful conclusion.
Same kind of data operation will cost same amount of bandwidth on any memory system, single pool or not.

I assume DME's do a lot more than simply manage data transfers between main memory and system RAM though since PC GPU's have something similar and if that was their only (or even primary) purpose in XB1 then that would imply there is an expectation of fairly significant data copy requirements between the 2 pools without any GPU intervention - which would be bad for bandwidth.

This is a fallacy. Just because DME facilitates memcpy which frees up GPU cycles, does not imply heavy memcpy between the pools, nor does it imply that that memcpy is somehow "required" in any sense particularly on the X1, in addition to what the algorithm already needs (i.e. PRT page pool).

pjbliverpool · Sep 13, 2013

Thanks for you responses taisui, appreciated. Would just like to clarify a couple of points though.

taisui said:
No true. If you are accessing the data then you are using BW to do it, even if within the same pool of memory, meaning that even on the PS4's single pool, in your own words, you still "waste" bandwidth when you memcpy data, it's not free.

I understand this, however what I'm saying is that if for some reason you need to copy data from one pool to another without making any changes to it (no idea why or if you would ever need to do this) then that would cost bandwidth on the XB1 whereas on PS4 the move wouldn't be required in the first place (because there's no other memory pool to move it to) and thus that bandwidth is saved.

So to follow your reasoning, then the X1 bandwidth has a "guaranteed" 68GBps on the DRAM + 109GBps on the ESRAM, with an additional 109GBps on the ESRAM due to simultaneous r/w when used perfectly.

Yes that's how I'm understanding it. And obviously none perfect balances of read/write to the esram will achieve somewhere between 109-218 GB/s

False assumption can't derive meaningful conclusion.
Same kind of data operation will cost same amount of bandwidth on any memory system, single pool or not.

I agree that the same op will cost the same bandwidth on both setups but unless I'm misunderstanding the situation (which is entirely possible of course) then with XB1 you have to balance read/write traffic to esram perfectly to achieve peak theoretical bandwidth while with a single GDDR5 pool it doesn't matter what the ratio of read to write is because GDDR5 can accommodate either at maximum rate (obviously not at the same time).

In addition if there are any requirements to move data between the two pools without making any changes to it then this is again wasted bandwidth in comparison to a system were no move was required.

So it seems to me that you *can* achieve the same percentage utilization of the available bandwidth in the XB1 but it would rely on avoiding 100% of data copy's between memory pools that don't require some kind of transformation in the GPU as well as making sure you are balancing read and write flows perfectly to the esram. And on the assumption that developers won't be able to achieve that 100% of the time I'm assuming a lower percentage utilization rate.

This is a fallacy. Just because DME facilitates memcpy which frees up GPU cycles, does not imply heavy memcpy between the pools, nor does it imply that that memcpy is somehow "required" in any sense particularly on the X1, in addition to what the algorithm already needs (i.e. PRT page pool).

Fair enough, I did pretty much say as much in my post though. i.e. my assumption is that the DME's are there for more than just simple copying of unchanged data from 1 pool to another otherwise that would imply such copy's where common (otherwise why have a dedicated unit for them). So we are in agreement here.

taisui · Sep 13, 2013

pjbliverpool said:
I understand this, however what I'm saying is that if for some reason you need to copy data from one pool to another without making any changes to it (no idea why or if you would ever need to do this) then that would cost bandwidth on the XB1 whereas on PS4 the move wouldn't be required in the first place (because there's no other memory pool to move it to) and thus that bandwidth is saved.

You are implying that the same algorithm requires a memcpy on the X1 but somehow magically it's not needed on the PS4, and that instead of just doing it purely within the DRAM (just like the PS4 in your example), the developer would actually choose to make a memcpy to ESRAM, which waste bandwidth and actually make things slower?

I'm pretty sure it's a argumentative fallacy.

pjbliverpool said:
So it seems to me that you *can* achieve the same percentage utilization of the available bandwidth in the XB1 but it would rely on avoiding 100% of data copy's between memory pools that don't require some kind of transformation in the GPU as well as making sure you are balancing read and write flows perfectly to the esram. And on the assumption that developers won't be able to achieve that 100% of the time I'm assuming a lower percentage utilization rate.

Let me get this right, so what you are saying is that, if the ESRAM does not do simultaneous read/write, it actually makes the X1 magically "better" because the utilization "can" be now 100% more easily?

Here's hint, lower percentage utilization rate means nothing, it does not imply one system is better/worse than the other, it just means the utilization is low.

SenjutsuSage · Sep 13, 2013

Something I've wondered from time to time is if perhaps the discussion on this is somehow leaving out a vital detail: The size of the ESRAM.

With the ESRAM only being 32MB, even in cases where data is being copied to or from it, if you're using one of the Move Engines, then 25.6GB/s of bandwidth should be able to rather easily chew through the relatively small amount of data that is likely being moved in a pretty short amount of time. You have to wonder then is it possible that with carefully managed use of the ESRAM, the copies to and from ESRAM may simply be much too small to tie up or consume the system's bandwidth for long enough for it to actually be detrimental to overall performance?

Things might be different if we were perhaps dealing with a much larger block of ESRAM. Then again, with a larger block, I guess there would be less need for as many copies.

There's 4 8MB blocks of ESRAM, and there's also 4 Move Engines. Could it be possible that even though the developer sees those 4 blocks of ESRAM as a single 32MB block, that the move engines are actually able to view each block independently, and address the 32MB of ESRAM not necessarily as a whole, but in smaller 8MB pieces? Would this help with the system's bandwidth consumption?

Gipsel · Sep 13, 2013

SenjutsuSage said:
There's 4 8MB blocks of ESRAM, and there's also 4 Move Engines. Could it be possible that even though the developer sees those 4 blocks of ESRAM as a single 32MB block, that the move engines are actually able to view each block independently, and address the 32MB of ESRAM not necessarily as a whole, but in smaller 8MB pieces? Would this help with the system's bandwidth consumption?

Very likely not. The address space ist most likely interleaved in some way between these 4 blocks so the load is distributed roughly equally (if it doesn't encounter an especially bad stride).
And as the DMEs are not created equal, each of them has access to the full eSRAM. That there are 4 DMEs is just coincidence.

DrJay24 · Sep 13, 2013

SenjutsuSage said:
Something I've wondered from time to time is if perhaps the discussion on this is somehow leaving out a vital detail: The size of the ESRAM.

With the ESRAM only being 32MB, even in cases where data is being copied to or from it, if you're using one of the Move Engines, then 25.6GB/s of bandwidth should be able to rather easily chew through the relatively small amount of data that is likely being moved in a pretty short amount of time. You have to wonder then is it possible that with carefully managed use of the ESRAM, the copies to and from ESRAM may simply be much too small to tie up or consume the system's bandwidth for long enough for it to actually be detrimental to overall performance?

Things might be different if we were perhaps dealing with a much larger block of ESRAM. Then again, with a larger block, I guess there would be less need for as many copies.

There's 4 8MB blocks of ESRAM, and there's also 4 Move Engines. Could it be possible that even though the developer sees those 4 blocks of ESRAM as a single 32MB block, that the move engines are actually able to view each block independently, and address the 32MB of ESRAM not necessarily as a whole, but in smaller 8MB pieces? Would this help with the system's bandwidth consumption?

If the ESRAM is not constantly streaming data in and out then it is not being effectively used. If it is not being used, then the system bandwidth is going to approach 68GB/s. To alleviate the bandwidth of the DDR3, the two pools have to be used in parallel as much as possible.

The Move engines all share the same bandwidth (~25GB/s over all four), so why would you split up the job of moving data?

Scott_Arm · Sep 13, 2013

DrJay24 said:
If the ESRAM is not constantly streaming data in and out then it is not being effectively used. If it is not being used, then the system bandwidth is going to approach 68GB/s. To alleviate the bandwidth of the DDR3, the two pools have to be used in parallel as much as possible.

The Move engines all share the same bandwidth (~25GB/s over all four), so why would you split up the job of moving data?

When you say "constantly streaming data" do you mean copying from DDR3 to ESRAM and from ESRAM to DDR3? What exactly do you mean when you say "constantly"?

pjbliverpool · Sep 13, 2013

taisui said:
You are implying that the same algorithm requires a memcpy on the X1 but somehow magically it's not needed on the PS4, and that instead of just doing it purely within the DRAM (just like the PS4 in your example), the developer would actually choose to make a memcpy to ESRAM, which waste bandwidth and actually make things slower?

I'm pretty sure it's a argumentative fallacy.

The PS4 GDDR5 has 176GB/s so operations that are possible within that memory pool may not be possible within the DDR3 of XB1. So if there is a scenario where you perform some work in the esram due to bandwidth requirements and then need to transfer it to dram without making any changes to it (as I've said already, I don't know if such a scenario exists) then of course that would be a memcpy that would not have been required on the system with a single unified pool of memory.

Let me get this right, so what you are saying is that, if the ESRAM does not do simultaneous read/write, it actually makes the X1 magically "better" because the utilization "can" be now 100% more easily?

Here's hint, lower percentage utilization rate means nothing, it does not imply one system is better/worse than the other, it just means the utilization is low.

You've obviously misunderstood what I'm saying. I'm not saying that simultaneous read/write is a disadvantage. Clearly it's an advantage. However when looking at the peak theoretical bandwidths of both systems it's also important to understand what percentage of that bandwidth can be achieved in the real world. And since the XB1 must rely on a perfect balance of read/write ops to the esram to achieve it's peak bandwidth while the PS4 does not have the same limitation, the PS4 will likely be able to achieve a higher percentage utilization of it's theoretical bandwidth than the XB1.

Percentage utilization alone is obviously not the whole story, no more than peak theoretical bandwidth is. It's the multiple of the two that's we should be focused on. That's why I'm drawing attention to the utilization rate. We already know the peak numbers.

taisui · Sep 13, 2013

Scott_Arm said:
When you say "constantly streaming data" do you mean copying from DDR3 to ESRAM and from ESRAM to DDR3? What exactly do you mean when you say "constantly"?

He means one should try to maximize the low latency and the high bandwidth on the ESRAM and not waste it.

A typical data flow out be:
DRAM(vertices/index/texture)->GPU->ESRAM(intermediary buffer)
DRAM/ESRAM->GPU->ESRAM (multiple pass)
ESRAM->GPU/Move->DRAM(front buffer)

If you are thinking that, in the sense that you are "copying" the data, you are not using it right, like:
DRAM->Move->ESRAM
ESRAM->GPU->ESRAM
ESRAM->MOVE->DRAM

However if you can stage the data that will be used by the GPU to the ESRAM preemptively, then the data copy actually would work in favor of overall performance.

Understanding XB1's internal memory bandwidth *spawn

mrcorbo

Foo Fighter

Shifty Geezer

uber-Troll!

mrcorbo

Foo Fighter

Cyan

orange

Betanumerical

pjbliverpool

B3D Scallywag

Shifty Geezer

uber-Troll!

bkilian

Gipsel

3dilettante

bkilian

taisui

pjbliverpool

B3D Scallywag

taisui

SenjutsuSage

Gipsel

DrJay24

Scott_Arm

pjbliverpool

B3D Scallywag

taisui

Similar threads