esram astrophysics *spin-off*

Status
Not open for further replies.
What could explain that it's performance is finicky? The theory of having 8 banks with one in conflict makes a lot of sense (88%), but why is it so far from it in real world?
Is that 33% typical for bidirectional ports, because the usage pattern isn't very symmetrical with real code?

The quoted peaks and "real-world" performance figures in the DF article are not described very well and the measurement methods are not disclosed.
The quoted scenario from the source is heavy on memory traffic and in theory it was supposed to be an illustration of a good use case, so I'm not sure why it would be this twitchy.

The fact that the documentation doesn't label the interface as 2x what is in the diagram is curious because there's nothing wrong with giving peak figures that assume no banking conflicts and a frequently unrealistic 1:1 read/write ratio. The lack of it is usually consistent with the top numbers being a more restricted use case, and the lack of detail on the measurement method means we can't rule out a range of common errors when it comes to benchmarking complex memory pipelines that will try to prefetch, buffer, and coalesce whatever they can.

The picture from the leaks is an incomplete one, so I'm awaiting something with more detail--preferably more direct than an anonymous source that is being passed along second and third hand with non-technical parties trying to interpret it before passing it along.
 
I don't know much about memory controllers so going to ask a stupid question. Would the ROPS be wired directly to both eSRAM and the path to DDR3? Common sense tells me no. And that it would go through some arbiter/switch which would likely be buffered. Could the 192GB/sec simply be referring to the sum of the external bandwidth from ROPS to said controller, are that certain read/write/mod ops can occur directly against the write buffer without traversing the 102GB path to eSRAM?
 
Not sure how you've come to that conclusion.

read + write = 2 ops

That gives you 7(2) + 1(1) = 15 ops per 8 cycles.

2 ops for all 8 would be twice the baseline BW (2(109) = 218 GB/s) and correspond to 16 ops per 8 cycles.

15/16 of 218GB/s = 204GB/s peak.

Since you can't really do read + write on all cycles it's not really appropriate to say it's 204GB/s peak and just leave it at that. So MS noted the baseline min of 109GB/s as well as the theo. peak of 204GB/s.
 
15/16 of 218GB/s = 204GB/s peak.

I understand the math. However, I think you're making a leap of faith in assuming that they've somehow, originally unknowingly, created a part that can truly read and write in a single cycle. IMO, it's more likely that the 204GB figure is an effective bandwidth rating for a particular sequence of data access where unnecessary ops are skipped due to some flag/signaling.

Sorry Shift. Started composing prior to your note. This is done. Thx.
 
Seems to vindicate the DF article (and a certain someone else, hehe). 204GB/s is boost gotten from going to 853MHz (was 192GB/s peak at 800MHz). But that's only really achievable for...well, basically nothing. This would seem to also confirm that the eSRAM reads and writes on the same cycle for what seems to be 7/8 of the cycles.

That leaves the question of why it has an 7/8 pattern. Access activity still occurs in step with the cycle-bound state stransitions, so what would be carried over to make the 8th cycle different?
The story seems like it would be more complicated because the memory subsystem would have little trouble filling in read and write traffic for a higher sustained amount unless there were other restrictions in play.

Can it be some form of cycle stealing?
I'm not sure what would be stolen, since it looks like the GPU memory system is the sole user of the interface.
 
Last edited by a moderator:
Right this seems like the most sensible conclusion to me. Not directly related, but Haswell GT3e's eDRAM works similarly in terms of separate read/write bandwidth.

Can you expand on how that works? I thought there were dedicated bus paths in each direction. Is there a penalty cycle in there that pauses one of the paths after a certain number of ops?
 
As HotChips hasn't so far provided any insight into that discussion, we can't add anything more than we've already had. Can we please leave that conversation to the existing thread and only touch upon it here when we get something concrete.

It seems concrete to me. Think of it like a prediction made by a certain hypothesis that was verified at Hot Chips. Sure, other hypotheses could somehow lead to a similar outcome, but none were presented afaik. The mechanism for how it can read/write simultaneously on some of the cycles is still unknown, but the DF article seems very much concretely vindicated here (once adjusted for clock adjustments).

If nothing else I think it is worth noting the speculation as it provides some warning about how especially unlikely the peak BW figure might be in terms of actually getting close to it in games.
 
You used the number from DF to form the hypothesis and the number from today to validate it? That is some circular logic.
 
Microsoft techs have found that the hardware is capable of reading and writing simultaneously. Apparently, there are spare processing cycle "holes" that can be utilised for additional operations. Theoretical peak performance is one thing, but in real-life scenarios it's believed that 133GB/s throughput has been achieved with alpha transparency blending operations (FP16 x4).

Well this is disapointing. So they only cite one example and that example can only reach 133GB/s?
 
I'm sorry to ask this question because i think it has already been answered but, how they can obtain a BW of 204GB/s without a dual bus? They clearly state 204GB/s in the presentation, not 102GB/s nor 133GB/s.
 
Well this is disapointing. So they only cite one example and that example can only reach 133GB/s?

Without further clarification, yes. A bit like saying my Camry can do donuts. But really only while on an ice rink with some people pushing one corner. It's not a common situation.
 
I'm sorry to ask this question because i think it has already been answered but, how they can obtain a BW of 204GB/s without a dual bus? They clearly state 204GB/s in the presentation, not 102GB/s nor 133GB/s.
Some renowned tech sites had been claiming time ago that the 200GB/s figure was rather creative thinking on Microsoft's side 'cos they were allegedly combining the bandwidth of the CPU-GPU connection -30 GB/s-, the eSRAM bandwidth -102GB/s- and the DDR3 numbers -68GB/s- to obtain 200GB/s.

30 + 102 + 68 = 200GB/s.

But those numbers have changed and don't apply anymore. I am certain that the presentation was mentioning the eSRAM specifically when they talked about 204GB/s of peak bandwidth.

Other than that, it's quite surprising that the console is going to use 8GB flash memory, which means it can be utilised as a cache, and popping will be a goner. The system looks like a very neat platform.

In my opinion, the Xbox One is one hell of a console. Technical beauty.
 
[strike]Focusing on the eSRAM, we have from the slides, 3 pieces of information:

1. 109GB/s min
2. 204GB/s peak
3. 4*256bit read and write.

109GB/s obviously is 800Mhz/15*16 bump resulting in 853.3Mhz and then *128 bits = 109.226 GB/s.

204GB/s, however, is still a mystery.
it does, however, support the 7/8 cycle read+write theory, as 109.226/8*15 = 204.8 GB/s.

The old 109.226 + 68 + 30 internal bandwidth that they used @ E3 this time gives us 207. This number is hard to confuse with 204GB/s so that shouldn't be the case this time.




However, the third line may shed some light into the mystery perhaps...?

=> 4*256bit read and write

The eSRAM seems to being divided into 4 8MB blocks, each with a 256 bit interface.

However, in our calculations, we (and they) have always used 800(853)Mhz * 128 bits instead of the full interface of one 8MB esram block, which is specified here to be 256bits.

Could there be a possibility that if all the read/write occur in a special pattern (i.e. all done on one block instead on touching several blocks, or copying between blocks or some creative usage) you could then achieve the magical read and writing on 7/8 of the cycles?

That may also explain how this peak is really unobtainable and realistically it ends up in ~133GB/s for some tests.[/strike]

Edit: Trash that.

Got bits/bytes mixed up.

4 blocks at 256/blocks = 1024 bits = 128 bytes/cycle. So there's still some sort of fishy thing unexplained.
 
Last edited by a moderator:
However, in our calculations, we (and they) have always used 800(853)Mhz * 128 bits instead of the full interface of one 8MB esram block, which is specified here to be 256bits.

You're mixing bits and bytes. Interface has always been assumed to be 4x256bit or 1024 bits wide. 1024 bits / 8 = 128bytes per cycle. 853Mhz * 128 bytes = 109GB/sec.
 
The 109 min to 204 peak is far from explained. If it were as simple as a 7/8 clock penalty, I think they would say that. Still feel there are for more esoteric requirements to exceed 109GB/sec.
 
You're mixing bits and bytes. Interface has always been assumed to be 4x256bit or 1024 bits wide. 1024 bits / 8 = 128bytes per cycle. 853Mhz * 128 bytes = 109GB/sec.

Ah true, didn't notice bit/byte problem, then trash that lol


Then it still remains unexplained.
 
Can you expand on how that works? I thought there were dedicated bus paths in each direction. Is there a penalty cycle in there that pauses one of the paths after a certain number of ops?
No I didn't mean it's similar to that level. I just meant separate read and write, therefore stating the "peak" as roughly twice the "min" bandwidth is probably fairly reasonable for graphics, which is what I assume they are doing. The "not perfect double" bit I don't know... could be a number of things I imagine.

Haswell's eDRAM is somewhat more complicated than simple read/write memory paths, but I'm not sure how much has been publicly disclosed. I think Marco (nAo) talked about it a bit HPG 2013 though, so maybe there's some slides floating around. That'd be a topic for another thread in any case.
 
Some really interesting tidbits from this excellent article:

http://www.extremetech.com/gaming/1...d-odd-soc-architecture-confirmed-by-Microsoft

The interesting ESRAM cache

First, there’s the fact that while we’ve been calling this a 32MB ESRAM cache, Microsoft is representing it as a series of four 8MB caches. Bandwidth to this cache is apparently 109GB/s “minimum” but up to 204GB/s. The math on this is… odd. It’s not clear if the ESRAM cache is actually a group of 4x8MB caches that can be split into chunks for different purposes or how its purposed. The implication is that the cache is a total of 1024 bits wide, running at the GPU’s clock speed of ~850MHz for 109GB/s in uni-directional mode — which would give us the “minimum” talked about. But that has implications for data storage — filling four blocks of 8MB each isn’t the same as addressing a contiguous block of 32MB. This is still unclear.
this should be an interesting micro-architecture in its own right. There are still questions regarding the ESRAM cache — breaking it into four 8MB chunks is interesting, but doesn’t tell us much about how those pieces will be used. If the cache really is 1024 bits wide, and the developers can make suitable use of it, then the Xbox One’s performance might surprise us.
I didn't know the eSRAM has been broken into four 8MB chunks. I wonder what are the implications of this. 8MB is just a very small amount of memory in order to fit a full 1080p framebuffer, isn't it? :eek:
 
Status
Not open for further replies.
Back
Top