esram astrophysics *spin-off*

Status
Not open for further replies.
@astrograd:
I still don't get how the Casimir effect and the restricted spectrum of vacuum fluctuations between two metallic plates have any closer relevance to the timings of a CMOS circuit.

They don't. I never claimed they did.

And nice dodge on the transistor state change. But it is still done in the first place by applying a voltage to the gate.
Do I need to explain what quantum tunneling is again? Yes, it involves a potential barrier. That's what gets tunneled through. And yes it involves the electrons' energy levels, which are adjusted via inducing electric fields in the dielectric to 'trap' the conduction electrons and kill their mobility. Tunneling rates are dependent on system parameters (obviously), including the voltage across the material. I'm astonished that you assert I 'dodged' anything given the explanation I offered. :rolleyes:

You overrate the tunneling in my opinion. The current through the channel is not tunneling if you switch the transistor to create a conductive channel. And a thermal excitation isn't tunneling neither.
Tunneling is the physical basis of the Kronig-Penney model from which semiconductor physics arises from. Periodic crystal lattice potentials with exponential Bloch functions are what gives you the energy bands in the first place. The Bloch functions exhibit tunneling just like any other wave function in a lattice would.

And look how a planar transistor looks like and now imagine a node shrink so the consumed area halves (and yes, I know it isn't done exactly this way anymore).
I've noted the classical equation already for capacitance. We can surely agree that it's correct and yes, it's proportional to area and inversely proportional to planar separation. I was simply asking for clarification as you asserted something in a vague fashion that isn't necessarily true if someone interprets your meaning as shrinking one dimension when you meant another.

And btw., that scaling law can't be off by a factor of 2.
Fair enough. I didn't notice that word there the first time. I was obviously referring to the actual formula. I see now that you did say it was for scaling, so that's fine. You may ignore my nitpick. But don't presume to assert what I am or am not aware of please.

It only gives the proportionality of changes. You can rescale with any constant factor that pleases you. That's the reason you haven't seen an "=" sign in it (∝, as commonly used today in English for proportionality is usually hard to find on a keyboard, the German version "~" much easier, and it also designates similarity in geometry, which is actually quite close to mean being proportional [same shape, scaled in size, can be rotated or flipped], which is why one may find it occasionally also in English texts with this meaning; just to give you the reason for my choice). As I explicitly mentioned a scaling law, I hoped it would be understandable. You can't use it to directly calculate the power consumption anyway without adding a few other terms (especially with todays small feature sizes, some years ago it was a better aproximation). But it still catches the main effect and was just meant to illustrate the importance of the capacitance.
You're talking to someone who researches quantum gravity for a living. You don't need to explain scaling relationships to me. You can afford to be condescending when you actually make a point premised on nuances I hadn't considered.

On another note, maybe TFETs become useful for some low power SRAM (and are relying on band to band tunneling for carrier injection), but these are simply not used in all the prevalent CMOS devices. ;)
That doesn't mean tunneling isn't the physical mechanism for MOSFETS used in CMOS designs. When electrons move through the semiconducting solid they tunnel through the potential barriers centered on the periodic lattice point nuclei. That's how conduction works physically. FET's kill that by lowering the kinetic energies of the electrons by entrapment via Coulomb attraction, but make no mistake...the reason they are 'trapped' and immobilized is because of the low KE not being high enough to effectively tunnel across lattice potentials. The exact same physical effect is at play there governing the conductance.

I think it doesn't make sense to dig deeper into that stuff as it has not much to do with the original point made. Agreed?
My only point on tunneling in the first place was that as you shrink your system it becomes more important for determining the time it takes for states to change. You seem to prefer arguing about how big a difference this makes, but neither of us have the details to actually do the calculations without knowing how things are doped, what materials are being used for the semiconductors, what voltages are put in place, etc.

I was never claiming that tunneling was the most apparent thing that would affect timings as you shrunk the eSRAM array. I was trying to speak to what could have those effects while alluding engineers and believe, quantum mechanical effects almost always allude engineers, even electrical engineers working in large labs at AMD et al. In any case, we can drop that issue if you like.

As someone else noted from their own experience, it seems that my conjecture is absolutely possible for real world manufacturing scenarios and my conjecture (obviously) does fit what I was told (which was vague but not worthlessly so) and it is in line with what MS apparently told devs. I find it hard to believe that MS told devs something was achievable in real world usage (133GB/s) if they didn't test it out first to make sure. The picture you painted on how DF got the info they did just sounds totally implausible and it doesn't match what little I was told on the subject personally.
 
Exactly. I have nothing technical to add to this discussion but general common sense tells me that if what Astrograd is saying were true then both Microsoft and AMD would have known it was a serious possibility at the design stage. And if there's a serious possibility that you can achieve a game changing double bandwidth then you'll be going after it with everything you've got from the start. You certainly wouldn't be surprised by the result.

We dunno the original designs. For all we know it may have been considered a 'remote possibility' from the outset and good yields and other manufacturing characteristics simply made it a fortunate reality for them. I'll again stress that this argument about design is premised on assuming that MS's setup is full spec DDR. Nobody suggested that. So keep that in mind.
 
All I can say is that I have a huge appreciate for anyone that is highly educated in physics or pure math. It's a path I wish I'd taken when I was in school.

Never too late. You'd be amazed at the learning resources you can find online for these subjects. I taught myself the majority of the physics I know today. Never went to class unless we had to take exams/quizzes. Did that all through my undergrad as well as my grad classes.

You can learn a LOT with online resources, be it wikipedia (which is unreasonably shit on in academia) or Open Courseware stuff from Yale/MIT, or just youtube searches. I always make sure to include links to much of this stuff when I can on my syllabi because it's been so helpful for me.

As an instructor it honestly makes me a little nervous about job security over the next decade or two.
 
Never too late. You'd be amazed at the learning resources you can find online for these subjects. I taught myself the majority of the physics I know today. Never went to class unless we had to take exams/quizzes. Did that all through my undergrad as well as my grad classes.

You can learn a LOT with online resources, be it wikipedia (which is unreasonably shit on in academia) or Open Courseware stuff from Yale/MIT, or just youtube searches. I always make sure to include links to much of this stuff when I can on my syllabi because it's been so helpful for me.

As an instructor it honestly makes me a little nervous about job security over the next decade or two.

I think as STEM explodes over the next decade, even individuals who are self taught at the lowest levels will need live knowledgeable instructors as well as laboratory apparatus to teach them further. Great discussion. I think MS/AMD may have simulated this behavior in software and on test silicon but were probably surprised at the quality and consistency of the yields which allowed them to realistically adopt this capability as a hardware feature.
 
Fascinating discussion but the reality is that MS just painted some stripes on the ESRAM and this is why it's faster.
It' called quantum striping.

Allow me to make a joke because the tension in this thread is palpable.
 
Keep getting reminded of a very noisy picture my 'Secret Source' sent me in May. I cannot personally see anything but, if you look carefully, there does seem to be a highlighted area at the bottom left which may hold some weight here?

Top_Secret.jpg
 
Allow me to make a joke because the tension in this thread is palpable.
I tried an xkcd joke to steer the discussion away from quantum mechanics, but instead it turned this place into the Intellectual Edition of MMA. :oops:
 
I still don't know how this physics discussion relates to the eSRAM bandwidth issue. I would really suggest again to concentrate on this point because the spoilered stuff below isn't really helpful. [strike]As you obviously want some exchange about it, I answer[/strike] (edit: didn't read your post to the end, I admit, but anyway, now it is written). But this is my last post in this direction.
Do I need to explain what quantum tunneling is again? Yes, it involves a potential barrier. That's what gets tunneled through. And yes it involves the electrons' energy levels, which are adjusted via inducing electric fields in the dielectric to 'trap' the conduction electrons and kill their mobility. Tunneling rates are dependent on system parameters (obviously), including the voltage across the material. I'm astonished that you assert I 'dodged' anything given the explanation I offered. :rolleyes:
Because it is still largely irrelevant for a conducting channel. It's probably more important for subthreshold leakage. Or if you have hopping transport because of localized defects as common in organic semiconductors or something.
Tunneling is the physical basis of the Kronig-Penney model from which semiconductor physics arises from. Periodic crystal lattice potentials with exponential Bloch functions are what gives you the energy bands in the first place. The Bloch functions exhibit tunneling just like any other wave function in a lattice would.
I guess the conduction in metals is also caused by tunneling and not the quasifree motion of electrons in conduction bands as the same reasoning of the periodic potentials apply there too for the valence electrons and you can draw the same band diagrams. But wait! We don't have just completely filled valence bands, but also conduction bands and the dispersion relation for carriers in a conduction band basically looks like that for a free particle (if one doesn't look too closely) with an effective mass given by the curvature of the band describing it pretty well for a rough approximation (just relying on some fading glimpse of knowledge from a lecture more than ten years ago). And yes, I know it is the semiclassical interpretation (but that's usually the one people are able to understand).
Or in other words, a Bloch wave is usually pretty delocalized. It doesn't have to tunnel from one site to the next, it simply propagates through the crystal almost as a free particle wave would. A Bloch wave is basically a plane wave modified by a periodic function for the local lattice influence. What do you think, how much importance does it have for transport properties? Is tunneling the right framework to describe the propagation of a plane wave through some lattice? Is it just me or is the usually reduced amplitude of the wavefunction after tunneling through a barrier missing in the Bloch wave solutions?
And iirc the bandgap in undoped silicon is about 1.1 eV. I can only imagine what happens when one has different dopings and one is basically able to shift the Fermi level locally in the channel by applying about 1 V to the gate. Maybe it enables to control if the conduction band is getting flooded with carriers or not so it basically forms a submicron sized electrical switch between source and drain regions? If the channel is depleted of free carriers, then of course electrons may tunnel through the now formed barrier with a length of a handful of nanometers. But this is just something one would like to get around if it would be possible. The functionality doesn't base on it. And of course there are massive complications to this picture, when one gets into the details. But as I said repeatedly, important things first.

To get back to your point, it is only conceivable they can use two transfers per clock cycle if it was planned like that from the start. Even the example given you mentioned, planned for that by making it flexible and configurable to use either of both modes. If it is not designed for that operation it will not work with two transfers per clock cycle.
 
Last edited by a moderator:
If astrograd would explain what he exactly means with "double pumping" a bit more (I asked for that) or that it may be not "full spec DDR" and how this is supposed to work, we may be able to proceed. His prior assertions regarding the timing of such a setup didn't sound convincing. My impression is he is either talking about a locally doubled clock (roughly what intel called double pumping in their P4, but where does this clock come from out of the sudden?) in 15 out of 16 cycles or two transfers per clock cycle in 15 out of 16 cycles (how does this work if it was not designed for that operation?).
 
Because it is still largely irrelevant for a conducting channel. It's probably more important for subthreshold leakage. Or if you have hopping transport because of localized defects as common in organic semiconductors or something.

The discussion was specifically about state switching times. Not the magnitude of current generated by a certain mechanism. Remember, this is about timings and trying to guess how they may have narrowed beyond what MS/AMD engineers would have expected.

Or in other words, a Bloch wave is usually pretty delocalized.

When you do a Fourier sum of Bloch functions to represent the wave packet of an electron it's not. The modulation coefficient to the exponential is itself an exponential.

It doesn't have to tunnel from one site to the next, it simply propagates through the crystal almost as a free particle wave would. A Bloch wave is basically a plane wave modified by a periodic function for the local lattice influence.

That's only if you pretend the ion cores produce infinitely thin delta function potentials so there is no room for the decay to take place. In real semiconductors that's not the case.

Is tunneling the right framework to describe the propagation of a plane wave through some lattice? Is it just me or is the usually reduced amplitude of the wavefunction after tunneling through a barrier missing in the Bloch wave solutions?

What happens is you gain the left-moving term as you aren't carrying the function out to spatial infinity since the subsequent ion core hits you first. But again, that's only if your wave function it constructed from assuming a Dirac Comb. Real lattices don't have those idealized potentials.

I don't really care enough to bother with semiconductor physics further either. I think we both can agree that aspects of the materials that affect the timings are in flux during manufacturing tests, yes? Can we also agree that MS/AMD would have been very conservative in their designs/expectations for a 32MB pool of eSRAM?
 
If astrograd would explain what he exactly means with "double pumping" a bit more (I asked for that) or that it may be not "full spec DDR" and how this is supposed to work, we may be able to proceed. His prior assertions regarding the timing of such a setup didn't sound convincing. My impression is he is either talking about a locally doubled clock (roughly what intel called double pumping in their P4, but where does this clock come from out of the sudden?) in 15 out of 16 cycles or two transfers per clock cycle in 15 out of 16 cycles (how does this work if it was not designed for that operation?).

I simply mean that during a single pulse you can read (rising edge) and write (falling edge). For that to be possible you need to be able to go from a read to a write operation within the width of that pulse, which sound like it's what your bolded part there is talking about. Seems to me that requires being able to tell your eSRAM to go from looking for a rising edge to looking for a falling edge. The time it takes to change those instructions may have somehow shrunk enough to fit within half a clock cycle.

Of course, if you raise the clock you shrink that pulse width and doing so too much means you encroach upon the time interval required to change from read to write. If you upclock too much, the pulse width becomes too thin to allow for the read to write transition to happen before that falling edge hits.

This was why I conjectured about it in the first place based on you (I think it was you) noticing the 16/15 factor. It just seems really coincidental for that to be the exact factor they raised by given that it would narrow the pulse width precisely to the point where any further and you lose your ability to read/write on that same pulse.
 
So no double read or double write? Strictly read on one edge and write on the other? This would have to be a capability of the memory controller as well. I don't see how the ESRAM and controller could do this if they weren't explicitly designed to operate on both clock edges.
 
Okay, it was not my last post about this.
The discussion was specifically about state switching times. Not the magnitude of current generated by a certain mechanism. Remember, this is about timings and trying to guess how they may have narrowed beyond what MS/AMD engineers would have expected.
These effects influence more power consumption and maybe reliability (especially when the difference between on and off gets fuzzy at lower operating voltages). For timing at the nominal voltage it's maybe a second order effect. As I said, largely irrelevant.
When you do a Fourier sum of Bloch functions to represent the wave packet of an electron it's not.
The wavepacket of a conduction band electron is also delocalized. Valence electrons are the more localized ones. There is a reason one draws the conduction band usually above the barriers between the lattice sites.
That's only if you pretend the ion cores produce infinitely thin delta function potentials so there is no room for the decay to take place. In real semiconductors that's not the case.
No. Afaik, the Bloch waves are the general solution for the stationary Schrödinger equation in a periodic potential irrespective of the exact shape of that periodic potential. Another potential just results in a different Bloch function one has to multiply to the plane wave equation to get the Bloch wave. Bloch waves of course neglect a few things, but not that.

===============
I don't really care enough to bother with semiconductor physics further either. I think we both can agree that aspects of the materials that affect the timings are in flux during manufacturing tests, yes? Can we also agree that MS/AMD would have been very conservative in their designs/expectations for a 32MB pool of eSRAM?
While we can certainly agree that simulating the timings before getting silicon back is a daunting task, I would wager the fabs have usually pretty good data on SRAM arrays. The early test wafers during process development often contain a lot of it.
Sure one can be conservative and all, but it still doesn't allow to switch the operation mode if it was not designed for it. All what a large timing margin usually allows to increase the clocks accordingly (as long as the power consumption doesn't get an issue).
 
I was told back in May that the eSRAM's bandwidth was much higher from a theoretical pov than what VGleaks had said due to the timings of the eSRAM which weren't expected to turn out as they did. I was also told that the real world bandwidth was over 130GB/s. That's all the info my source had on the cause of the boost. Here is what DF's sources said:

So how could Microsoft's own internal tech teams have underestimated the capabilities of its own hardware by such a wide margin? Well, according to sources who have been briefed by Microsoft, the original bandwidth claim derives from a pretty basic calculation - 128 bytes per block multiplied by the GPU speed of 800MHz offers up the previous max throughput of 102.4GB/s.

It's believed that this calculation remains true for separate read/write operations from and to the ESRAM. However, with near-final production silicon, Microsoft techs have found that the hardware is capable of reading and writing simultaneously. Apparently, there are spare processing cycle "holes" that can be utilised for additional operations. Theoretical peak performance is one thing, but in real-life scenarios it's believed that 133GB/s throughput has been achieved with alpha transparency blending operations (FP16 x4).

The math works out such that if you are capable of reading/writing during the same cycle for 7/8 cycles you get the quoted 192GB/s figure.

Since I was told the stated info WAY before DF was, I'm inclined to believe that DF didn't just make it up nor is it at all likely their multiple sources did either (that'd be one helluva coincidence). So here we are.

I submit that timings are something important as that's what I was told.
 
Neither of us have any idea about the context of this discovery. Let's stop pretending we were in the room when they discovered this.
They would have discovered something that they provisioned for when they designed the device.
The design was likely set for quite some time, at least since last year going by tape-out rumors, and its validation tests and debug checks would have started development and alongside the simulated and then fabricated hardware.
Unless the possibility was targeted earlier, this new feature would have been sprung right before mass production without the testing and validation occurring alongside. Instead AMD and Microsoft would have blown months of work on the interface they did design, but not the one they ended up with.

I was told it wasn't something they expected to have on the table, which is similar to what DF was told. We don't really have a justified reason to cast this literally as a totally unforeseen accident that they had no idea was possible.

You design all 8 cycles to double pump if that's your goal, sure. But this wasn't what we got here. If my theory on what happened is right, I don't see anything particularly shocking about it. Just good luck on MS's part. It could have very well been something that only popped up as they were finalizing their production testing.

You are adding color to your description where it doesn't belong. Loaded language trying to paint this as a wholly chaotic and spontaneous flickering of electrons swarming all over the place isn't going to help anyone explain what I was told in May nor what DF was told in June. Violating assumptions isn't always a bad thing.
It is very much a bad thing in this context. The memory pipeline has a series (likely a very large set) of defined state transitions it has to make, and there are conditions set when it must have certain outputs relative to clock signals and active or inactive signal lines.
A single-pumped interface would accept commands from a client during one cycle, and an operation like a read would return a result in a future cycle.

I find it difficult to accept that double-pumping as it is commonly described--and DDR memory in particular--can be just wished into this. There would be stages in the pipeline that were built to have a valid state matching a read being performed, and now instead at the end of the valid time window we have the state belonging to a completely different write operation.
If there is a way for the units to pick up on the falling edge of a clock and to properly arbitrate two accesses concurrently, there is going to be hardware and defined state transitions for them to handle it.


As I said, if they were being conservative, which I submit is a rather reasonable and obvious scenario given the context surrounding the manufacturing/design of the eSRAM, then things turning out better than expected shouldn't be twisted into something dangerously foreboding.
Misjudging performance by almost 100% is not what I consider a reasonable expectation of the competence of the engineers involved.

I'm willing to stomach things ranging from slightly goofy developer documentation, development programmers writing code loops that don't correct for cache and memory coalescing effects, or just bad communication and abuse of terminology. The latter is not too difficult in my mind to accept when dealing with DF or Vgleaks when we are going by the interpretation of non-technical writers, who in some cases may not even be native speakers in the case of Vgleaks.

I consider the claim that the engineers that built it got things this wrong to be an extraordinary one that requires evidence of the same caliber.

What specifically do you mean here? If you are referring to current leakage causing issues then that depends on the type of transistors being used.
My point is that memory arrays are accessed after a defined sequence of signals being asserted and deasserted over one or more clock cycles. Deciding to make one part of the memory subsystem perform two operations in the same clock cycle requires that the control signals that feed into it and the signals it sends down the pipeline can also be changed as quickly. If it was never an intended option, then someone should have started asking why all this redundant, unusable, or unstable hardware was being put into place.

If the memory port transitions faster than the control and arbitration logic, it is changing its outputs and making state transitions in the latter half of the clock cycle based on inputs that are not changing. Why we would expect a valid output from this is something I would want justification for.
If the hardware is able to change the inputs that fast, or it is able to respond to a separate set of signals, then it is something that was put in place at design time.

If you read my other posts you will see what I am suggesting. Namely that it's possible that during manufacturing/testing they found out their conservative designs in the massive pool of eSRAM turned out to give them more wiggle room that they didn't realize they had. Have ppl been manufacturing 32MB pools of eSRAM for yrs?
eSRAM, no, but a number of designs have done better. Intel has had had high-end caches of over 20MB of SRAM on-die since 2005, on-die since late last year. Off-die, 32MB SRAM caches were used by IBM since 2007.
AMD hasn't had monolithic caches of that size, although it has had cache hierarchies of 16MB on-die at much higher speed with full coherence since 2011--32MB if counting the coherent MCM products.
It's not undiscovered territory as to how it should function, although making it consistently manufacturable at the sort of yields a low-margin part requires is something AMD has not had a chance to prove itself with.

I'm not looking to find something unexpected in the space between 16 and 32 MB. The SRAM arrays in those caches don't interact with each other, and the memory pipelines themselves don't need to behave all that differently beyond a few extra signals to select from a slightly larger set of sub-arrays.



I'm going to ask for clarification on the quantum tunneling theory, not in terms of physics, but in terms of why it should matter at the level of abstraction or the macroscopic level of the interface.
Quantum tunneling is an effect exploited for flash memory, at the level of the individual floating gates and their oxide layers that measure few nanometers in width.

There are quantum effects that are dealt with or mitigated on a transistor or silicon layer basis when evaluating their performance and reliability, but the fabrication and design engineers thus far have been struggling mightily to keep it out of the protocol and state layers of abstraction, which is where the SDR or DDR mode of operation would be recognized.
What happens in the span of a transistor's gate stack is something where we have perhaps tens of atomic layers, but the logic layers and state transitions govern a macroscopic set of pipeline stages, SRAM arrays, and wires that extend multiple millimeters.

True quantum effects that dominate the actual function (turn on or off) of transistors or logic instead of their performance (the slope of the voltage/current curve) are not expected until below 7 or 5 nanometers, which is also where many observers expect silicon scaling to just throw up its hands.
This leaves current silicon an order of magnitude higher than that point, and the actual units and hardware in Durango another one or two beyond that where quantum tunneling is meant to be a significant factor or scaling killer.
 
Last edited by a moderator:
Was it double pumping SDR? I would like to make keep terms like 'DDR mode' out of the conversation because the technique for operation is double pumping, DDR is the spec classification for silicon designed ground up to take achieve that efficiently and reliably. It's purely semantics, I know, but I think some ppl here have conflated the two and ignored the distinction, causing them to focus their skepticism on expectations about circuitry designs instead of actual operation (which is the claim made regarding the eSRAM boost).

Interesting. Thanks.

The controller and device were capable of being configured to sample on just the up portion of the clock, just the down portion of the clock, or both by setting a few hardware registers.

Note that for the thing I was working on, this was designed in from the start.

However, it doesn't actually take that many more transistors to make something that can be mode switched, or have adjustable clock dividers, etc, so you can get maximum manufacturing and deployment flexibility.

Also, if you're licensing a macro from someone else for your design, it's possible they already threw in a lot of this stuff for you, and you just end up qualifying what clocks and voltages you can run at, and which features you can turn on and which you have to disable given the parameters of your design and the electrical characteristics of your layout, the actual manufacturing quality of what you get back from the factory, etc.
 
Last edited by a moderator:
Here is a fairly recent patent about SRAM double pumping. There are several other articles listed along with it.

It describes making SRAM with double pumping ability without requiring either much more die space for more ports or higher internal clocks. It seems to make sense to have this functional possibility built in if it doesn't have a large additional cost.


http://www.google.com/patents/US7643330#forward-citations

"In sum, a two-port SRAM design is presented with an associated die area comparable to a one-port SRAM. To achieve area efficiency, the read and write ports are restricted to mutually synchronous operation, which represents the common usage model for many applications. By restricting both ports of the SRAM to synchronous operation, a dual-pump timing model can be introduced, whereby one pre-charge cycle may be eliminated. By eliminating one pre-charge cycle and allocating one read and one write time slot within each clock cycle, the SRAM design can provide the functionality of two access ports that operate in an edge-triggered clocking regime.

While the forgoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. For example, aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software."
 
Some stuff of course needs to toggle faster (or need to be added in the first place) for this to work. Maybe it's the most basic version, no idea. I think there are multiple patents on dual/multiported SRAM using slightly different multiplexing techniques to access the data on SRAM arrays with less ports then exposed to the outside.
 
SRAM that can be read and written at the same clock is, as mentioned in the patent, frequently done by adding transistors to the cells and adding area so that there are separate ports to handle the operations.

For a design with density rated a top priority, like a GPU, that is prohibitive.
The patent, from my parsing, performs the read and write over the same wires by assigning different phases of the clock to each operation, with the read and write sharing the single precharge operation per clock as opposed to the read getting its own precharge on its own lines and the write getting its own.
There's a control unit and IO which is shown explicitly taking the necessary inputs and changing the select signal at double the reference clock in order to get sensible output from the arrays.

I'm not sure of the full range of tradeoffs, apart from some controller complexity.
One tradeoff is that the clock cycle is now longer.
It must contain a common precharge, a read, the wait period between the read and write, and a write in a single cycle.
A true dual-ported SRAM would have two precharges and the read and write in parallel.
I'm not sure if there's extra time required to make sure that the write has time to stabilize on lines that are not fully precharged thanks to the read that came before. I haven't gone through the legalese to see where this is addressed.
There are some general problems in the smaller nodes with noise margins and variation that might be affected if the prior read can nudge voltages away from the idealized precharge levels (which tuning circuits would be calibrated to), although that can be compensated for by extra time or making the write signal stronger so it can swamp the effect faster.

For a GPU, the relaxed latency requirements and larger amount of register capacity make this more acceptable than they would in a CPU.
Given the owner of this patent, I expect Durango wouldn't be using this particular design. ;)
 
What could explain that it's performance is finicky? The theory of having 8 banks with one in conflict makes a lot of sense (88%), but why is it so far from it in real world?
Is that 33% typical for bidirectional ports, because the usage pattern isn't very symmetrical with real code?
 
Status
Not open for further replies.
Back
Top