esram astrophysics spin-off

astrograd · Aug 31, 2013

DrJay24 said:
So does this boil down to MS is using Nvidia's patent? Does MS have to disclose IP they use, can we look it up?

http://www.google.com/patents/US7643330

Which of course doesn't explain the 7/8 hand waving.

I didn't read the patent, but dual porting is different than double pumping. The first part of what ya quoted sounds like what MS seems to have done, but not sure what they mean by the dual port aspect.

EDIT: Reading it now...it does actually sound a lot like what I was talking about. It'd be hilarious if MS used ideas from Nvidia in the eSRAM in X1 to get that extra bit of performance. Ha!

astrograd · Aug 31, 2013

temesgen said:
Why are you so emotionally invested in this?

For someone who professes to have an academic background surely you should understand and appreciate the need for sourcing information. If Bristol Meyer's new nutritional is presented as a new innovative approach toward diabetes in an article written by a respected journal without any sourcing or explanation of the mechanism supporting the claim and at the next big endocrinology conference they don't even mention the product people will have questions.

I don't see the analogy. Is DF the respected journal here? Because this was leaked info to them (and me). Not MS running around presenting anything. Also, they did source their claim, two-fold in fact. They got the info from dev sources, who got it from MS. I got my info from an insider source who I won't name for the obvious reasons. Don't like my sourcing? Sad day for you. I'm no expecting you to believe that I really was told as much as I didn't bring it up until DF's article showed up, but Rangers and others can attest that I knew well beforehand and told them about it.

If you meant HotChips was the respected 'publication' of sorts, then you need to realize what I explained to Shifty already, which is that if we are talking about double pumping as all the evidence seems to exclusively suggest, nobody there would be impressed that MS discovered such a phenomena. Double pumping is old news and MS's guys had loads to talk about and only 30mins to do it.

And by noting that Charlie's write up had typos in it means I'm emotionally invested? Good grief...stop being so sensitive man. Sheesh.

It doesn't mean they are lying but it doesn't mean they are immune from questions either.

No one is suggesting they are immune to questions about it. Hell, I've expressed my view that the tech press should be beating on MS's door to get the info on it since Monday. It's official spec now, so it should be fair game. No clue how you convinced yourself I was in favor of MS not talking about it. I want the polar opposite of that.

B3D should be less about circling the wagons to protect the brand I favor and more about communicating about the tech, how it works and the pros and cons of various applications. There is so little information available that very little meaningful discussion can even occur yet some appear to be offended for even pointing that out.

What *exactly* are you accusing me of? If you've nothing to add, then don't post. Tacitly suggesting I'm a fanboy because I don't bend to assertions made on the internet isn't contributing anything to the thread. Clearly what I was told and what I had as to the figures claimed was correct. Get over it.

3dilettante · Aug 31, 2013

astrograd said:
I'd also add that double pumping in some fashion clearly seems to be confirmed as well based on the specs we do have. Surely you can agree with that too.

Double pumping isn't the only way to allow a memory pool to perform simultaneous operations.
I've asked what parts of the disclosed information point to this solution in particular.

I was specifically told it was read + write on same cycle. If it was dual ported that would have been known from day 1 and they would have only listed peak and not the 'min' alongside it, so the only other option I'm aware of is double pumping. DF was told the same as per their article.

A read and write to a what appears at a high level to be a monolithic storage block is not automatically double-pumping.
Banking within the arrays can allow for multiple operations to occur as long as they don't hit the same bank.

A double-pumped storage pool would have been as clear a target as a dual-ported one.
An interface with some kind of store buffer or combining cache could have higher peak numbers that wouldn't normally be given except for special cases.

Right, I had made the point that shrinking things down (like circuirt elements) can typically lead to adjustments in the quickness of state changes due to quantum effects becoming more and important as your shrink your elements.

This is where I think the narrative takes a leap. The circuit elements in question that are materially relevant to the logic of the interface and its protocol are safely macroscopic in scale.
Whether a sense amp or write driver is activated is digital state transition in units and wires across a wide segment of the chip.
There are quantum effects, yes, but they are many layers of abstraction below the design features of the array access logic.

If enough quantum events over many picoseconds lead to undesirable device attributes for many transistors, which then combine with their wires to produce undesirable analog behavior, which leads to signal transitions that do not meet desired digital timings, which then lead to the circuit to then fall out of spec for that phase of the pipeline--there's a good chance the hardware is broken.
I guess there could be some more complex set of programmable features put into place, but at the very least the Nvidia patent I address later would not work in this circumstance.

I never said DF mentioned anything specific about manufacturing.

The speculation put forward was about quantum effects that might lead to that, as DF's source told them that MS was suggesting it was something that came about via manufacturing.

Do you know what the article's source said to DF, or was that a misstatement?

It does, once you account for the upclock.

(133 GB/s)*(853/800) = 142 GB/s

...which is entirely in line what Charlie was told from his article today. The peak value also scales identically.

A dual ported or double-pumped memory pool has fewer restrictions when it comes to approaching peak performance than a design that uses a pseudo-ported or banked architecture that serializes if accesses conflict.
The articles keep reiterating that peak is hard to approach, which sounds like a situation more complicated than an SRAM that can take whatever you throw at it.

I also still speculate that the clock boost was limited by the timing mechanism here, whatever form it takes, since that math works out a bit too perfectly to brush off as mere coincidence imho.

Maybe, maybe not.
128B x .703 GHz = 90 x2 = 180 * .9375 = 168.8
128B x .750 GHz = 96 x2 = 192 * .9375 = 180
128B x .8 GHz = 102.4 x2 = 204.8 * .9375 = 192
128B x .853 GHz = 109.2 x2 = 218.4 * .9375 = 204.8
128B x .910 GHz = 116.5 x2 = 233.0 *.9375 = ~218.4

We're in the clock range where the roundish numbers are notches 1/16 apart. I dunno.

As you have said before, clearly the 204 GB/s figure is achievable somehow, lest no way MS would have claimed it as their peak at a symposium like HotChips. So we have both figures, which is rare and probably rather unique. Ok, so why didn't they feel comfortable only giving us the peak value? I submit that it was likely because it is wholly misleading.

If it is capable of reaching peak as often as the banked cache architectures of most desktop CPUs, GPUs, and local data shares, it is well within the common practices of describing the properties of a memory array.
The occasional misalignment or bank conflict isn't held against designs that sometimes have to serialize accesses.
Perhaps there are more gotchas than the idealized scenario permits.

Actually, a disclosure of the mix that gives peak performance could go some way in illuminating why the eSRAM complex behaves the way it does.

Either way, if it was something they knew full well about all along the VGLeaks info wouldn't be what it was and they'd likely feel comfy claiming only the peak value instead of seemingly feeling obliged to mention both. So why list both?

Their competitor has a GDDR5 bus that provides a big peak number for them to wow technophiles?

astrograd said:
I didn't read the patent, but dual porting is different than double pumping. The first part of what ya quoted sounds like what MS seems to have done, but not sure what they mean by the dual port aspect.

EDIT: Reading it now...it does actually sound a lot like what I was talking about. It'd be hilarious if MS used ideas from Nvidia in the eSRAM in X1 to get that extra bit of performance. Ha!

The patent contains a design that very clearly targets double pumping, however. It's hard-baked into the design of the memory array's cycle activity with dedicated read and write phases that only precharges the bit lines once.

The clock cycle is actually longer as a result, and it may have consequences for a larger memory array in terms of variability and noise margin if more memory cells hang off lines that are kept noisier because of the shared precharge.
The benefit is the area savings, but it trades off in circuit performance.

The activity of the combined operation appears to be contained within a clock cycle, going by the timing signals in figure 6. There isn't a persistent effect that can cause problems some number of cycles down the line.
When array access transitions are contained thus, what mechanism would lead to complications for the eSRAM?

dobwal · Aug 31, 2013

Shifty Geezer said:
Because it's potential a complete revolution within the EE sector!! Seriously, they've come up with a BW enhancing feature that no-one can fathom, and they didn't talk about it at all? "Hi guys, welcome to HotChips. Today we'll be talking about games console APU, which is much like any other AMD APU. One special feature we have is that we've found a way to get partial double transfers across a bus boosting bandwidth by 30% or more. I see you're all pretty excited at that prospect! But we won't talk about that. Instead we're going to discuss some conventional specialist processing blocks in there." They didn't present any interesting tech at HotChips, neither their ToF sensor nor how they have achieved something no-one else has achieved for boosting IO BW. I think a lot of us presumed MS would show something more interesting. I don't honestly know why they bothered showing what they did because it doesn't reflect anything of interest within the EE sector. They kept out all the juicy bits.

But thats just it. If MS is doubling up accesses during a cycle then MS isn't doing anything revolutionary or new. There is even term for it. Virtual multi-porting. The IBM Power 2 and some of the old Alphas double pump their caches (MS is full of IBM engineers and employs the founder of DEC).

Virtual multi-porting and multi-banking as alternatives to dual porting memory was more heavily considered in the 1990s when transistors were more expensive and dual porting could add 30%-50% to the memory's silicon cost. IBM even has a patent on virtual triple porting memory from 1995-6. The reason you don't see it on modern cpu is that virtual multi-porting doesn't scale well with today's processors that operate well above the 1GHz. Furthermore, transistors are cheapers and dual porting is not as costly as decades ago.

But a sub Ghz GPU and the cost dual porting 32mb of SRAM on a chip that is already bigger than 99% of other modern processors may have created a circumstance where an old solution is practical again.

So yes I could see MS engineers going "Its single port and sub Ghz so lets try to see if we can virtual multi-port this thing?" and being surprised that an old solution thats been impractical in commerical mainstream design for years works on a modern day gpu. Under those circumstances, you are not going to grab a bull horn and tell the world how you recreated the wheel.

McHuj · Aug 31, 2013

One of the things that I'm wondering about with regards to embedded memory and the APU is how much contention does the SRAM free up.

In a tradtiional APU, you have the CPU, GPU, and any other clients competing for bandwidth. Is there a penalty when switching from read to write operations? Or from servicing one device to the next? For example, when the CPU is writing to the main RAM and the GPU wants to read at the same time, obviously there is some arbitration, to decide who gets priority. But is there a penalty in between accesses or can they pretty much go back to back on cycles? If there is, I assume you want to block transfers of data.

In theory, the SRAM could really help in not just providing bandwidth, but providing a guaranteed uninterrupted bandwidth. Thoughts as to if this an actual quantifiable benefit?

astrograd · Aug 31, 2013

3dilettante said:
Double pumping isn't the only way to allow a memory pool to perform simultaneous operations.
I've asked what parts of the disclosed information point to this solution in particular.

They are clearly read + write during the cycles. Based on what else we know, it can't be on all the cycles. I'm open to alternatives, but the end result, whatever the mechanism, is very obviously in line with the DF article which so many here and elsewhere dismissed out of hand.

This is where I think the narrative takes a leap. The circuit elements in question that are materially relevant to the logic of the interface and its protocol are safely macroscopic in scale.

Errr...what length scale are you considering "macroscopic" here? Don't confuse 'macroscopic' with 'quantum effects'. Not at all the same thing. And there is no leap in the narrative. It's openly speculating how the 'surprise' aspect might have occurred. Nothing more. There's no "narrative" to speak of there.

Do you know what the article's source said to DF, or was that a misstatement?

"Well-placed development sources have told Digital Foundry that the ESRAM embedded memory within the Xbox One processor is considerably more capable than Microsoft envisaged during pre-production of the console, with data throughput levels up to 88 per cent higher in the final hardware.

...

Now that close-to-final silicon is available, Microsoft has revised its own figures upwards significantly, telling developers that 192GB/s is now theoretically possible.

...

Well, according to sources who have been briefed by Microsoft, the original bandwidth claim derives from a pretty basic calculation - 128 bytes per block multiplied by the GPU speed of 800MHz offers up the previous max throughput of 102.4GB/s. It's believed that this calculation remains true for separate read/write operations from and to the ESRAM. However, with near-final production silicon, Microsoft techs have found that the hardware is capable of reading and writing simultaneously. Apparently, there are spare processing cycle "holes" that can be utilised for additional operations."

^^^Seems pretty clear to me that they are suggesting that during the final production runs they were seeing things they didn't expect relative to the initial production runs.

Maybe, maybe not.
128B x .703 GHz = 90 x2 = 180 * .9375 = 168.8
128B x .750 GHz = 96 x2 = 192 * .9375 = 180
128B x .8 GHz = 102.4 x2 = 204.8 * .9375 = 192
128B x .853 GHz = 109.2 x2 = 218.4 * .9375 = 204.8
128B x .910 GHz = 116.5 x2 = 233.0 *.9375 = ~218.4

We're in the clock range where the roundish numbers are notches 1/16 apart. I dunno.

Only 1 of those matches the actual figures (bolded for emphasis). The rest do not.

Their competitor has a GDDR5 bus that provides a big peak number for them to wow technophiles?

...which was my point. If we play make believe and pretend this was a figure born out of PR considerations, we wouldn't see the 109 GB/s figure at all.

The activity of the combined operation appears to be contained within a clock cycle, going by the timing signals in figure 6. There isn't a persistent effect that can cause problems some number of cycles down the line.
When array access transitions are contained thus, what mechanism would lead to complications for the eSRAM?

If they aren't timed perfectly you can get the nth cycle to be off and incapable of read + write.

For instance, there is a finite amount of time that is required for these elements to be prepped for the upcoming operation and if the time it takes to swap from a read state to a write state. If that is longer than half a pulse, then over time after n cycles you will accumulate a shift where this pair of states can be on the cycles and they can shift enough to be in a read state (for instance) only to see a falling edge hit them first. That would cause them to miss their opportunity for that read state to actually perform the corresponding read operation. Instead it would have to wait until the cycle completes and the next rising edge comes around. Net result would be on one cycle you only get a read or a write whereas on the others in the "bunch" you get read + write. If the timings work out such that the "bunch" of cycles is 8, you get the 15 ops in 8 cycles I've mentioned before, leading to the 15/16 factor that makes the official figures work.

This was what made me initially speculate openly about size requirements for quantum fx as quantum mechanics is what governs the transition times to change states, not because the elements themselves are super tiny (far below macroscopic scale though, fyi), but because the state change timings are dependent on the distance between elements depending on the arrangement and element in question. Those distances most likely ARE super tiny.

dobwal · Aug 31, 2013

Is alpha blending a work load that will naturally reach the theoretical peak of the memory involved?

Why is there some assumption that the hardware has some type of technical penalty involved where it can't read and write simultaneously across all cycles? Versus something like memory contention or other issues that normally crops up that inhibits most real world workloads from 100% throughput even when talking no ability to read and write simultaneously.

3dcgi · Sep 1, 2013

dobwal said:
But thats just it. If MS is doubling up accesses during a cycle then MS isn't doing anything revolutionary or new. There is even term for it. Virtual multi-porting. The IBM Power 2 and some of the old Alphas double pump their caches (MS is full of IBM engineers and employs the founder of DEC).

Virtual multi-porting and multi-banking as alternatives to dual porting memory was more heavily considered in the 1990s when transistors were more expensive and dual porting could add 30%-50% to the memory's silicon cost. IBM even has a patent on virtual triple porting memory from 1995-6. The reason you don't see it on modern cpu is that virtual multi-porting doesn't scale well with today's processors that operate well above the 1GHz. Furthermore, transistors are cheapers and dual porting is not as costly as decades ago.

But a sub Ghz GPU and the cost dual porting 32mb of SRAM on a chip that is already bigger than 99% of other modern processors may have created a circumstance where an old solution is practical again.

So yes I could see MS engineers going "Its single port and sub Ghz so lets try to see if we can virtual multi-port this thing?" and being surprised that an old solution thats been impractical in commerical mainstream design for years works on a modern day gpu. Under those circumstances, you are not going to grab a bull horn and tell the world how you recreated the wheel.

Virtual multi-porting has been used in GPUs for years so it hasn't been impractical in that realm.

AlNom · Sep 1, 2013

dobwal said:
Is alpha blending a work load that will naturally reach the theoretical peak of the memory involved?

Alpha blending was the example case for hitting the 133GB/s figure in the DF article.

dobwal · Sep 1, 2013

3dcgi said:
Virtual multi-porting has been used in GPUs for years so it hasn't been impractical in that realm.

Didnt think it was. Just thought that smaller caches and latency insensitive design doesnt place as much pressure on cache performance as cpus. Hence dont readily hear about it on gpus. That and the fact MS engineers seem to have more cpu design experience and well known examples are from the 90s.

Seems nowadays its only regularly explored on fpgas where running asynchonous sram is a commonly supported feature and typically fpga runs at speeds more conducive to virtual multiporting.

dobwal · Sep 1, 2013

AlNets said:
Alpha blending was the example case for hitting the 133GB/s figure in the DF article.

Yeah, havent paid complete attention to the 7/8 or 15/16 talk. Just seems to me that the port can't run a perfect multiple of the gpu. Operating at 1500 and 1600 mhz instead of 1600 and 1706.

taisui · Sep 1, 2013

dobwal said:
Yeah, havent paid complete attention to the 7/8 or 15/16 talk. Just seems to me that the port can't run a perfect multiple of the gpu. Operating at 1500 and 1600 mhz instead of 1600 and 1706.

Interesting, maybe that's just it.

Simutaneous R/W access:

http://www.ign.com/boards/threads/h...x-one-esram-faster-than-ms-thought.453149643/

BW went from 102GBps to 192GBps.
This could come from double pump/2-port/async design, math:

GPU at 800Mhz with 1024b gives you 102GBps
However if the SRAM runs at 1500Mhz will give you 192GBps max.

The GPU upclock:

http://www.techradar.com/news/gamin...x-one-to-crank-up-its-graphical-power-1170690

GPU at 853Mhz, gives you 109GBps min.

The Hotchips:
Assuming that they also up the clock on the SRAM
1600Mhz at 1024bit wide gives you 204GBps peak
Min doesn't change, at 109GBps

So this looks like some asynchronous design then? Min BW guaranteed by the GPU but max BW capped by the SRAM?

3dilettante · Sep 3, 2013

astrograd said:
They are clearly read + write during the cycles. Based on what else we know, it can't be on all the cycles. I'm open to alternatives, but the end result, whatever the mechanism, is very obviously in line with the DF article which so many here and elsewhere dismissed out of hand.

I'm afraid I can't speak for other posters' opinions.
There were enough people trying to reconcile a number of points based on a very imprecisely worded article from people whose testing methods were not disclosed.

The insinuation that the people building the memory subsystem could have missed something this significant is in itself an claim of significant import, which the article did nothing to explain.
If there was uncertainty about what numbers would be finalized for the latencies, or bugs that needed a stepping to fix, or a decision to change a control register to allow reads and writes to issue at the same time in a late firmware revision, it wouldn't be seen until late in the process by those further out from the teams involved with the actual manufacturing or design.

The software teams that are coding apps or benching the system are likely more numerous, and outside of the smaller teams with more detailed understanding of the system.
Their greater number increases the chance that you'd find one in contact with DF, and the reduced exposure to the subsystem details would make them more likely to be surprised.

Unrealistic example of a non-double pumped storage array that relies on heavy banking to allow multiple writes in a cycle:

eSRAM is broken into rows in a "horizontal" orientation.
Those horizontal blocks are subdivided into vertical banks.
There are two semi-linked paths in the eSRAM pipeline: one for reads and one for writes.
If all conditions permit, both paths can execute an operation, otherwise it just does one.

Parameters:
Interpret Microsoft's slide of 256 read&write as there being 256 bits of bandwidth in each direction.
16 banks, each 256b wide.
Optimized for access patterns that don't frequently overwrite themselves.

Rules for dual issue:
No simultaneous hits to the same bank on the same line (pipeline can wake up another line for same bank, but with penalty that drops bandwidth below 7/8).
Varying latencies for different operations--will delay issue for one pipe if not met. (The numbers are chosen for simplicity.)
Borrows the theme from the patent, where operations can share setup work--some more than others.
One constraint is that the ideal pattern must allow access to all banks for reads and writes. It may be possible with different parameters to create high-bandwidth patterns that isolate reads and writes to specific banks.

Cycle number for:
Read after read: n+16
Write after read: n+2
Read after write: n+15
Write after write: n+16

Cycle#
Wn = Write to bank n
Rn = Read to bank n

The bank numbers don't matter beyond denoting when conflicts can arise.

Code:

0    1     2     3     4     5     6     7     8     9     A     B     C     D    E     F     0    1     2     3     4     5     6     7     8     9     A     B     C     D     E     F    0    1   2
W0   W1    W2    W3    W4    W5    W6    W7    W8    W9    WA    WB    WC    WD   WE    WF         W0    W1    W2    W3    W4    W5    W6    W7    W8    W9    WA    WB    WC    WD    WE   WF       W0
R2   R3    R4    R5    R6    R7    R8    R9    RA    RB    RC    RD    RE    RF         R0    R1   R2    R3    R4    R5    R6    R7    R8    R9    RA    RB    RC    RD    RE    RF         R0   R1  R2

edit: (fixed some of the following text)
There are two cycles every 16 that single-issue, that is 30/32 of potential operations.

The wait states in question are measured in cycles, and could be tweaked with control registers, leading to more or less peak and sensitivity to un-ideal patterns.
The difficulty code would have when trying to match this pattern is that any accesses that don't fit the profile will lead to single-issue cycles. If concurrent programs are working separately, each can throw off the pattern, leading to bandwidth bouncing somewhere between peak and minimum.

Errr...what length scale are you considering "macroscopic" here? Don't confuse 'macroscopic' with 'quantum effects'. Not at all the same thing.

I agree that it's not the same thing, as the quantum effects are either drowned out or become part of larger electrical or thermal trends. Those trends themselves are a layer or two below the functions in question.
The eSRAM will likely be in the tens to hundred or so mm2. It will have components stretching millimeters in length, and involve disparate and complex components with billions of atoms in a thermally and electromagnetically noisy environment where behaviors are accounted for in classical terms.

And there is no leap in the narrative. It's openly speculating how the 'surprise' aspect might have occurred. Nothing more. There's no "narrative" to speak of there.

There is a chain of events and a process for designing and manufacturing a device like this.
The companies, engineers, and design teams involved in this have a very strong interest in characterizing their respective chemical, electrical, and digital product to a very high degree.

Saying that quantum tunneling can cause a double-pumped bus imputes a lot about those teams and what happened in the process.

^^^Seems pretty clear to me that they are suggesting that during the final production runs they were seeing things they didn't expect relative to the initial production runs.

I brought up multiple things that can change during the trial silicon phase for the chip: physical manufacturing refinements like process tweaks, bug fixes, low-level setting changes, and potentially base layer changes.

The quoted text said something was observed in the near-final silicon, not what changed to bring it about.

Only 1 of those matches the actual figures (bolded for emphasis). The rest do not.

You said the math of the upclock was too close to be a coincidence. I thought you were referring to how the upclock brought the almost-double peak of the 800 MHz GPU to have a final peak that was 2x of 128B x .8GHz.
I gave four other combinations where the math works the same, all linearly close clock increments that were frequently used for CPUs back when they were in that speed range.

If they aren't timed perfectly you can get the nth cycle to be off and incapable of read + write.

The array accesses are synced to the clock signals. They don't start or end without the clock's transition, so what is off?

For instance, there is a finite amount of time that is required for these elements to be prepped for the upcoming operation and if the time it takes to swap from a read state to a write state. If that is longer than half a pulse, then over time after n cycles you will accumulate a shift where this pair of states can be on the cycles and they can shift enough to be in a read state (for instance) only to see a falling edge hit them first.

That would cause them to miss their opportunity for that read state to actually perform the corresponding read operation. Instead it would have to wait until the cycle completes and the next rising edge comes around.

Unless the hardware is designed for it, the falling edge isn't critical beyond the fact that it needs to happen for there to be another rising edge.
The prep work should also be synchronous, so it would be interesting to see how this would work--or be considered working.
The more likely physical result of this state of affairs is that the array performs the same transitions when prompted by the clock as every other cycle, but will read invalid data.

This was what made me initially speculate openly about size requirements for quantum fx as quantum mechanics is what governs the transition times to change states,

The analog behavior of the circuits governs to the times. Quantum effects manifest in aggregate as the material and device properties of the components of those circuits.
There are certain phenomena that are manifestations of quantum effects. Tunneling can turn up as a mechanism for leakage. Improved SRAM performance can come from as little of it happening as possible, rather than more of it.

Brad Grenz · Sep 3, 2013

But wouldn't it be way cooler if the extra bandwidth was coming from a neighboring universe?

astrograd · Sep 5, 2013

Brad Grenz said:
But wouldn't it be way cooler if the extra bandwidth was coming from a neighboring universe?

Troll elsewhere big guy. Clearly physics is above your pay grade. Trying to be dismissive about what ppl who do it for a living say on the subject just because you don't grasp it doesn't make you look clever or intelligent.

3dilettante, I don't think you understand what I was saying on a few parts and will reply when I have more free time tomorrow or Friday.

Gipsel · Sep 5, 2013

I think not many understood what you were saying, irrespective if it is below or above our pay grade as physicists (I also do this for a living, didn't I mention it?).

So maybe agreeing on some common base may help. It's perfectly conceivable, that the eSRAM in the XB1 has a peak bandwidth of ~200GB/s. That's not a major problem (the L2 cache of the GPU has a higher aggregate bandwidth for instance). The question is how it is exactly achieved and if the bandwidth was a surprise to the engineers involved. Especially with the last point a lot of people in the thread have a problem with. Usually you know the peak bandwidth/clock of some design very well before anything gets even close to silicon.

itsmydamnation · Sep 5, 2013

astrograd said:
Troll elsewhere big guy. Clearly physics is above your pay grade. Trying to be dismissive about what ppl who do it for a living say on the subject just because you don't grasp it doesn't make you look clever or intelligent.

Obviously cache design and data access is well above yours or you would be way to busy to post fantasy here because you would be busy being paid eleventy billion dollars by intel because you just worked around the single biggest limitation going forward for compute scaling.

or maybe im a simple low paid pesent ( want to compare tax statements i bet i kick your arse!) who doesn't know anything :yep2:

dobwal · Sep 9, 2013

Is this possibly eSRAM?

MEMORY ARCHITECTURE FOR READ-MODIFY-WRITE OPERATIONS
http://www.google.com/patents/US20130159812

According to one embodiment, a memory architecture implemented method is provided, where the memory architecture includes a logic chip and one or more memory chips on a single die, and where the method comprises: reading values of data from the one or more memory chips to the logic chip, where the one or more memory chips and the logic chip are on a single die; modifying, via the logic chip on the single die, the values of data; and writing, from the logic chip to the one or more memory chips, the modified values of data.

By providing a read-modify-write operation that can operate within memory architecture 100 and can be controlled by logic chip 120, read-modify-write operations 400 can be performed more quickly because data does not need to be sent to external client 480 (or another client) for modification (e.g., transfer the value 340 to CPU 320 in FIG. 3), and also does not need to be sent back (e.g., transfer the value 360 from CPU 320 in FIG. 3). Overall, power and energy can be saved by avoiding the transfers of data external to memory architecture 100.

In one example implementation, multi-threaded programs can be provided to memory architecture 100. Many multi-threaded programs require synchronization primitives, such as atomic increments, atomic test-and-set, atomic test-and-swap, atomic swap, and atomic logical operations on memory, such as a logical AND, OR, Exclusive-OR, and others. Multi-threaded programs can be implemented through locking/blocking support in the memory hierarchy, which can add significant complexity to the memory coherence protocols. Instead, these operations can be directly supported by logic chip 120 of memory architecture 100. For example, an atomic increment command may be provided by memory architecture 100 that accepts an address and an increment amount. Upon receiving the command, memory architecture 100 can load the value from the specified address, can increment the value by the increment amount, and can store a modified value back to the memory, while ensuring that no other requests (read, write, or another atomic read-modify-write operation) access the same memory location at the same time.

In another implementation, applications using conditional writes can be used with memory architecture 100. For example, many applications, particularly multi-media applications, make use of conditional writes. Conditional writes can utilize read-modify-write operations that can read a value from memory, test the value against some condition, and then if the condition is true, can write a new value into the memory. In one embodiment, logic chip 120 of memory architecture 100 can implement a circuit that performs a conditional-write operation. One example can be saturation, where a command can provide a memory address, a threshold value, and a saturation value. Logic chip 120 can load a value from an addressed memory location and compares it to a threshold value. If the value is greater than the threshold value, then the saturation value can be written into the memory instead, and in either case the final value can be written back to the memory. Other embodiments may include Z-test (e.g., in computer graphics, comparing a Z (depth) value of a new pixel with a Z buffer (or depth buffer) value of a present pixel, and writing the Z value if the new pixel has a smaller value than (or is “in front of”) the present pixel), absolute value, positive or negative comparisons (either greater than a threshold or less than a threshold), text manipulations (e.g., convert lower case text to uppercase text), or any other conditional-write operations.

Other embodiments may include compression (e.g., read compressed data, decompress-modify-recompress, write back), encryptions (e.g., read encrypted data, decrypt-modify-encrypt, write back), or any other form of encoding. Embodiments could support any one or a plurality of encoded read-modify-write operations.

While many of the examples above discuss read-modify-write operations applied to singular memory locations, embodiments could also support vector or Single Instruction, Multiple Data (SIMD) versions of these operations that operate on multiple memory locations (e.g., from two or four consecutive locations, to a full page (e.g., 4 KB) or more). Such implementations could also enable additional operations, such as search, compare, find min/max values, and sum all values.

In one implementation, a method including logic-layer read-modify-write operations for all memory technologies can be included. While DRAM can be one memory technology, implementations can be applied to memory systems implemented with one or more of DRAM, SRAM, eDRAM, phase-change memory, memristors, STT-MRAM (Spin Transfer Torque-Magnetoresistive random access memory), or other memory technologies.

In one implementation, logic chip 120 can be manufactured using a different process from storage chips or memory chips that include storage and memory on the chip. Accordingly, logic chip 120 can be manufactured with performance, power, and energy provisions. For example, new chips can be manufactured that are optimized for logic chip performance, power, and energy.

3dilettante · Sep 9, 2013

I don't think this has much to do with the eSRAM, which is an on-die solution.

The patent seems to be more concerned about off-die memory, and supplying external logic for performing atomic and other operations next to or in the DRAM, instead of having the main chip and its memory hierarchy do the same.

Instead of having a CPU memory pipeline broadcast invalidates to all caches, wait for a possible writeback, read from main memory, get its cache line into an exclusive state, perform the operation while ensuring atomicity, possibly write back to memory, the main chip can instead send a read/modify write op to memory, which will do all the work without involving the main cache hierarchy.
The dedicated logic can also accept more complicated operations that would require repeats of the process if the main cache hierarchy is hit by requests.
GPU atomics in various architectures can do something similar, with dedicated units in shared memory, the L2, or ROPs.

This seems to help AMD's goal for heterogenous solutions that need to perform atomic operations or synchronize across wildly different cache hierarchies.

If some of the functions here can be done by the eSRAM's logic, that would be at a level of detail that has not been disclosed.

Pete · Sep 10, 2013

So Albert Penello, officially unofficial XB1 message deliverer, just said "204gb/sec on ESRAM" and "ESRAM can do read/write cycles simultaneously."

I realize assumptions are dangerous, especially in this thread, but can I assume that the ESRAM is simply clocked at 800MHz?

Edit: Doh! Forgot about the minimum. Assumption quashed.

esram astrophysics spin-off

astrograd

astrograd

3dilettante

dobwal

McHuj

astrograd

dobwal

3dcgi

AlNom

Moderator

dobwal

dobwal

taisui

3dilettante

Brad Grenz

Philosopher & Poet

astrograd

Gipsel

itsmydamnation

dobwal

3dilettante

Pete

Moderate Nuisance

Similar threads

esram astrophysics *spin-off*

Moderator

Philosopher & Poet

Moderate Nuisance

Similar threads

esram astrophysics spin-off