Understanding XB1's internal memory bandwidth *spawn

Hey, taisui, 3dilettante (or basically anybody with greater knowledge of memory) could you please help me out?

What if the volatile bits that were talked about the ps4 were applicable to durango and its esram?

Would it be okay under that circumstance to say that for alpha blending FP16 X 4, every 16 bits of data had a volatile bit associated it. With a 256 bit X 4 interface, you wouldn't get 256 bits per pool of eSRAM but 240 bits. 240 X 4 would give 960 bits of worth data accessed per cycle or 120 bytes X 2 at a DDR which at 800 mhz would work out to 192 GB/s. At 853 mhz that works out to 204.72 GB/s. Or am I wrong in my math or reasoning?


If SIMD takes about 4 cycles to complete an operation, would alpha blending worked out like this?

60 pixels (960 bits) read in a cycle with 4 cycles needed to complete operation on gpu and an additional cycle for write to eSRAM. Throughput would be 660 pixels per 16 cycles extrapolated out to 1.6 Ghz would that work out to 132 GB/s? Thats missing a GB every second though.
 
The volatile bit is a cache line status flag indicating the value was loaded from coherent memory.
There's a bunch of strikes against using in this regard.

It wouldn't be included in the data bandwidth of the storage pool, it wouldn't be every 16 bits, and the usage described for it so far wouldn't apply for blending because the ROPs don't deal with coherent memory.
It's also described as a Sony customization, so there's that as well

edit:

As sram is multiple times more costly in die real-estate than edram wouldn't that suggest some really big benefit with going with the former(sram)?
There's a benefit for having a large on-die memory pool.
There's a multitude of reasons why SRAM or eDRAM could be chosen or not, and not all of them are related to the code being run on them.
At that size, it could be argued that eDRAM would be superior in many ways, but there would have been concerns from the perspective of manufacturing the device and its ability to be updated to future nodes.

Would the use of esram as opposed to edram suggest a different purpose for for the 4x8MB cache? Unlike edram which was used for frame buffering, post processing and RoP performance improvements (I might be wrong about RoP improvements)?
The makeup of the storage cell is not really relevant to the bits stored in it.
 
Last edited by a moderator:
The volatile bit is a cache line status flag indicating the value was loaded from coherent memory.
There's a bunch of strikes against using in this regard.

It wouldn't be included in the data bandwidth of the storage pool, it wouldn't be every 16 bits, and the usage described for it so far wouldn't apply for blending because the ROPs don't deal with coherent memory.
It's also described as a Sony customization, so there's that as well

AMD patent.
Abstracting scratch pad memories as distributed arrays
http://www.google.com/patents/US20130212350

Many programming models (for example, in graphics processing units (GPUs), heterogeneous computing systems, or embedded architectures) have to control access to multiple levels of a memory hierarchy. In a memory hierarchy, certain memories are close to where the operations happen (e.g., an arithmetic logic unit), while other memories are located farther away. These different memories have different properties, including latency and coherency. With latency, the farther away the memory is located from where the operation happens, the longer the latency. With coherency, when a memory is located closer to the chip, it may not be able to see some reads and writes in other parts of the chip. This has lead to complicated programming situations dealing with addresses in multiple memory spaces.

A distributed array provides an abstraction through a uniform interface in terms of reads and writes, and can guarantee coherency. Accesses to memory are partitioned, such that how the user programs the memory access is how the memory access is compiled down to the machine. The properties that the programmer provides to the memory determines which physical memory it gets mapped to. The programmer does not have to specify (as under the OpenCL model) whether the memory is global, local, or private. The implementation of the distributed array maps to these different memory types because it is optimized to the hardware that is present and to where a work item is dispatched.

I was thinking while eSRAM may not be cpu coherent, can't it still be I/O coherent? Or basically coherent with portions of gpu caches.

I was thinking one of the function of the volatile bit was selective flushing.

Probably wrong in my thinking so thanks.
 
Last edited by a moderator:
Hey, taisui, 3dilettante (or basically anybody with greater knowledge of memory) could you please help me out?

What if the volatile bits that were talked about the ps4 were applicable to durango and its esram?

Would it be okay under that circumstance to say that for alpha blending FP16 X 4, every 16 bits of data had a volatile bit associated it. With a 256 bit X 4 interface, you wouldn't get 256 bits per pool of eSRAM but 240 bits. 240 X 4 would give 960 bits of worth data accessed per cycle or 120 bytes X 2 at a DDR which at 800 mhz would work out to 192 GB/s. At 853 mhz that works out to 204.72 GB/s. Or am I wrong in my math or reasoning?

I don't feel it's likely that theoretical bandwidth formula are more detailed than ports x interface x frequency, since it's just theoretical, and it has always been computed this way for the longest time.

Since Panello had already clarified the 204 vs 218 discrepency, it would seem that it's just as simple as a typo.
 
I was thinking while eSRAM may not be cpu coherent, can't it still be I/O coherent? Or basically coherent with portions of gpu caches.
There seems to be some kind of link hinted at in the Hot Chips slide, but it doesn't describe what that is.

I was thinking one of the function of the volatile bit was selective flushing.
That bit is set for cache lines loaded from coherent memory, which the ROPs do not deal with. The primary purpose is for selective flushing, but the first official disclosure of it was Sony saying it was a feature they asked for.
 
That was one of my criticisms of the DF article that leaked it. It's a real-world example, but no context was given as to where it stood in the continuum of workloads that one would probably find running. Is it an example of a good utilization case, a mediocre one, or a bad one? The secret sources, via a non-technical writer, did not say.
What is the optimum mix?
What exactly were they measuring?
Are there cache and buffering effects that needed to be corrected for?

All we get is a number and a host of ambiguities and questions.

To be fair, Richard put all the info he had from that post on MS's Developer Central on the ESRAM upgrade so it's not like he could have given more clarity there.
 
To be fair, Richard put all the info he had from that post on MS's Developer Central on the ESRAM upgrade so it's not like he could have given more clarity there.

My criticism was to the probative value of the information he was given, which is where some if the more technical skepticism from myself and a few other posters came in.
I know internal sources like to give tidbits that say something without saying everything, which leaves room for interpretation on the part of the writer and reader.
If it's an overly positive interpretation that doesn't bear out after people start paying money, they can say "aw shucks" all the way to the bank.

More details have slowly leaked out over time, but that little bit of contextless information was problematic because so many scenarios could produce it--ranging from improper testing, cache effects, various design choices, and memory optimizations that are common enough that they are expected as a matter of routine.

My first reaction was that the little bit of information was leaked ambiguously on purpose.
It's not like the console makers are above trumpeting every standard CPU and GPU feature as something new and mind-blowing.
 
My criticism was to the probative value of the information he was given, which is where some if the more technical skepticism from myself and a few other posters came in.
I know internal sources like to give tidbits that say something without saying everything, which leaves room for interpretation on the part of the writer and reader.
If it's an overly positive interpretation that doesn't bear out after people start paying money, they can say "aw shucks" all the way to the bank.

My first reaction was that the little bit of information was leaked ambiguously on purpose.
It's not like the console makers are above trumpeting every standard CPU and GPU feature as something new and mind-blowing.

Well, the thing is Richard was sent a direct cut & paste of a post MS made on their internal MS Dev Central developer resource so he was just telling us what MS was announcing to devs in private.

So it wasn't like some internal source was leaking him (and him alone) tidbits that made the XB1 seem better than it is.
 
To be fair, Richard put all the info he had from that post on MS's Developer Central on the ESRAM upgrade so it's not like he could have given more clarity there.

Richard? You mean Albert Penello?

BTW, long time no see. ;)

EDIT: Nevermind, brain fart. Richard Ledbetter DF, move along, it's too early this morning. LOL

Tommy McClain
 
Well they also ended up with, according the DF article, a 45% bandwidth advantage over the alternative memory configuration (200gbs/(172*0.8)) in typical use cases. it may have been a case of having their cake and eating it as well.

We have no idea how often the 150GB/s figure for the eSRAM happens, I suspect it is not as often as people think.
 
because the equation is 80% of a peak bandwidth is what is attainable in real world as the average in most cases (as MS asserts and makes sense until proven otherwise). The numbers they gave were on such measures between the ESRAM and DDR3 (80% of bandwidth) which they say has been proven with actual real code, not tests.

So Cranky was using that to make comparative points. If you want to leverage the full GGDR5 bandwidth, then make sure to apply full ESRAM and DDR3 bandwidth as well. One party does not get an exclusion while the other doesn't.

And please do not bring up the indie dev that said they got the high end of bandwidth on the other console as no information has been provided as to what exact code or such was used to leverage that. I am sure any dev can write code to maximize bandwidth, but it doesn't represent real world application.

The problem is that the 150GB/s is not achievable all the time. Its a 'sometimes case' just like any other peak bandwidth, but it has further caveats.
 
The problem is that the 150GB/s is not achievable all the time. Its a 'sometimes case' just like any other peak bandwidth, but it has further caveats.

No, the 208GB/s is not achievable all the time. The 150GB/s is the regularly achievable amount. This is just the ESRAM as well.
 
No, the 208GB/s is not achievable all the time. The 150GB/s is the regularly achievable amount. This is just the ESRAM as well.

But didn't they say they measured the 150GB/s in only one case? not all the time, if thats the case then its not the same kind of number.
 
because the equation is 80% of a peak bandwidth is what is attainable in real world as the average in most cases (as MS asserts and makes sense until proven otherwise). The numbers they gave were on such measures between the ESRAM and DDR3 (80% of bandwidth) which they say has been proven with actual real code, not tests.

So Cranky was using that to make comparative points. If you want to leverage the full GGDR5 bandwidth, then make sure to apply full ESRAM and DDR3 bandwidth as well. One party does not get an exclusion while the other doesn't.

And please do not bring up the indie dev that said they got the high end of bandwidth on the other console as no information has been provided as to what exact code or such was used to leverage that. I am sure any dev can write code to maximize bandwidth, but it doesn't represent real world application.


That is correct. I was actually being generous there and accounting for less efficiency for the ESRAM as 200/272 actually equals 73%, which I used for the numerator and not 80% which I used for the alternative. Had I used equal coefficients then the advantage would have been 59%.
 
But didn't they say they measured the 150GB/s in only one case? not all the time, if thats the case then its not the same kind of number.
Not really:
Nick Baker: Over that interface, each lane - to ESRAM is 256-bit making up a total of 1024 bits and that's in each direction. 1024 bits for write will give you a max of 109GB/s and then there's separate read paths again running at peak would give you 109GB/s. What is the equivalent bandwidth of the ESRAM if you were doing the same kind of accounting that you do for external memory... With DDR3 you pretty much take the number of bits on the interface, multiply by the speed and that's how you get 68GB/s. That equivalent on ESRAM would be 218GB/s. However, just like main memory, it's rare to be able to achieve that over long periods of time so typically an external memory interface you run at 70-80 per cent efficiency.
The same discussion with ESRAM as well - the 204GB/s number that was presented at Hot Chips is taking known limitations of the logic around the ESRAM into account. You can't sustain writes for absolutely every single cycle. The writes is known to insert a bubble [a dead cycle] occasionally... One out of every eight cycles is a bubble, so that's how you get the combined 204GB/s as the raw peak that we can really achieve over the ESRAM. And then if you say what can you achieve out of an application - we've measured about 140-150GB/s for ESRAM. That's real code running. That's not some diagnostic or some simulation case or something like that. That is real code that is running at that bandwidth. You can add that to the external memory and say that that probably achieves in similar conditions 50-55GB/s and add those two together you're getting in the order of 200GB/s across the main memory and internally.
One thing I should point out is that there are four 8MB lanes. But it's not a contiguous 8MB chunk of memory within each of those lanes. Each lane, that 8MB is broken down into eight modules. This should address whether you can really have read and write bandwidth in memory simultaneously. Yes you can there are actually a lot more individual blocks that comprise the whole ESRAM so you can talk to those in parallel and of course if you're hitting the same area over and over and over again, you don't get to spread out your bandwidth and so that's why one of the reasons why in real testing you get 140-150GB/s rather than the peak 204GB/s is that it's not just four chunks of 8MB memory. It's a lot more complicated than that and depending on how the pattern you get to use those simultaneously. That's what lets you do read and writes simultaneously. You do get to add the read and write bandwidth as well adding the read and write bandwidth on to the main memory. That's just one of the misconceptions we wanted to clean up.
 
The problem is that the 150GB/s is not achievable all the time. Its a 'sometimes case' just like any other peak bandwidth, but it has further caveats.

Are you sure? I have heard that 204GB/s is not achievable all the time and 150GB/s is real world number from real tests with real apps.
 
But didn't they say they measured the 150GB/s in only one case? not all the time, if thats the case then its not the same kind of number.

No, they said that it is actual real code, not tests and such. I think you read it backwards.

And then if you say what can you achieve out of an application - we've measured about 140-150GB/s for ESRAM. That's real code running. That's not some diagnostic or some simulation case or something like that. That is real code that is running at that bandwidth.
 
I would like further clarification from someone who knows more to be honest.

This is what DF said previously.

Theoretical peak performance is one thing, but in real-life scenarios it's believed that 133GB/s throughput has been achieved with alpha transparency blending operations (FP16 x4).
 
Back
Top