esram astrophysics *spin-off*

Discussion in 'Console Technology' started by astrograd, Aug 3, 2013.

Thread Status:
Not open for further replies.
  1. astrograd

    Regular

    Joined:
    Feb 10, 2013
    Messages:
    418
    Likes Received:
    0
    They don't. I never claimed they did.

    Do I need to explain what quantum tunneling is again? Yes, it involves a potential barrier. That's what gets tunneled through. And yes it involves the electrons' energy levels, which are adjusted via inducing electric fields in the dielectric to 'trap' the conduction electrons and kill their mobility. Tunneling rates are dependent on system parameters (obviously), including the voltage across the material. I'm astonished that you assert I 'dodged' anything given the explanation I offered. :roll:

    Tunneling is the physical basis of the Kronig-Penney model from which semiconductor physics arises from. Periodic crystal lattice potentials with exponential Bloch functions are what gives you the energy bands in the first place. The Bloch functions exhibit tunneling just like any other wave function in a lattice would.

    I've noted the classical equation already for capacitance. We can surely agree that it's correct and yes, it's proportional to area and inversely proportional to planar separation. I was simply asking for clarification as you asserted something in a vague fashion that isn't necessarily true if someone interprets your meaning as shrinking one dimension when you meant another.

    Fair enough. I didn't notice that word there the first time. I was obviously referring to the actual formula. I see now that you did say it was for scaling, so that's fine. You may ignore my nitpick. But don't presume to assert what I am or am not aware of please.

    You're talking to someone who researches quantum gravity for a living. You don't need to explain scaling relationships to me. You can afford to be condescending when you actually make a point premised on nuances I hadn't considered.

    That doesn't mean tunneling isn't the physical mechanism for MOSFETS used in CMOS designs. When electrons move through the semiconducting solid they tunnel through the potential barriers centered on the periodic lattice point nuclei. That's how conduction works physically. FET's kill that by lowering the kinetic energies of the electrons by entrapment via Coulomb attraction, but make no mistake...the reason they are 'trapped' and immobilized is because of the low KE not being high enough to effectively tunnel across lattice potentials. The exact same physical effect is at play there governing the conductance.

    My only point on tunneling in the first place was that as you shrink your system it becomes more important for determining the time it takes for states to change. You seem to prefer arguing about how big a difference this makes, but neither of us have the details to actually do the calculations without knowing how things are doped, what materials are being used for the semiconductors, what voltages are put in place, etc.

    I was never claiming that tunneling was the most apparent thing that would affect timings as you shrunk the eSRAM array. I was trying to speak to what could have those effects while alluding engineers and believe, quantum mechanical effects almost always allude engineers, even electrical engineers working in large labs at AMD et al. In any case, we can drop that issue if you like.

    As someone else noted from their own experience, it seems that my conjecture is absolutely possible for real world manufacturing scenarios and my conjecture (obviously) does fit what I was told (which was vague but not worthlessly so) and it is in line with what MS apparently told devs. I find it hard to believe that MS told devs something was achievable in real world usage (133GB/s) if they didn't test it out first to make sure. The picture you painted on how DF got the info they did just sounds totally implausible and it doesn't match what little I was told on the subject personally.
     
  2. astrograd

    Regular

    Joined:
    Feb 10, 2013
    Messages:
    418
    Likes Received:
    0
    We dunno the original designs. For all we know it may have been considered a 'remote possibility' from the outset and good yields and other manufacturing characteristics simply made it a fortunate reality for them. I'll again stress that this argument about design is premised on assuming that MS's setup is full spec DDR. Nobody suggested that. So keep that in mind.
     
  3. astrograd

    Regular

    Joined:
    Feb 10, 2013
    Messages:
    418
    Likes Received:
    0
    Never too late. You'd be amazed at the learning resources you can find online for these subjects. I taught myself the majority of the physics I know today. Never went to class unless we had to take exams/quizzes. Did that all through my undergrad as well as my grad classes.

    You can learn a LOT with online resources, be it wikipedia (which is unreasonably shit on in academia) or Open Courseware stuff from Yale/MIT, or just youtube searches. I always make sure to include links to much of this stuff when I can on my syllabi because it's been so helpful for me.

    As an instructor it honestly makes me a little nervous about job security over the next decade or two.
     
  4. blakjedi

    Veteran

    Joined:
    Nov 20, 2004
    Messages:
    2,985
    Likes Received:
    88
    Location:
    20001
    I think as STEM explodes over the next decade, even individuals who are self taught at the lowest levels will need live knowledgeable instructors as well as laboratory apparatus to teach them further. Great discussion. I think MS/AMD may have simulated this behavior in software and on test silicon but were probably surprised at the quality and consistency of the yields which allowed them to realistically adopt this capability as a hardware feature.
     
  5. Cjail

    Cjail Fool
    Veteran

    Joined:
    Feb 1, 2013
    Messages:
    2,027
    Likes Received:
    210
    Fascinating discussion but the reality is that MS just painted some stripes on the ESRAM and this is why it's faster.
    It' called quantum striping.

    Allow me to make a joke because the tension in this thread is palpable.
     
  6. Popup

    Newcomer

    Joined:
    Oct 29, 2007
    Messages:
    11
    Likes Received:
    0
    Keep getting reminded of a very noisy picture my 'Secret Source' sent me in May. I cannot personally see anything but, if you look carefully, there does seem to be a highlighted area at the bottom left which may hold some weight here?

    [​IMG]
     
  7. MrFox

    MrFox Deludedly Fantastic
    Legend Veteran

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,996
    I tried an xkcd joke to steer the discussion away from quantum mechanics, but instead it turned this place into the Intellectual Edition of MMA. :oops:
     
  8. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I still don't know how this physics discussion relates to the eSRAM bandwidth issue. I would really suggest again to concentrate on this point because the spoilered stuff below isn't really helpful. [strike]As you obviously want some exchange about it, I answer[/strike] (edit: didn't read your post to the end, I admit, but anyway, now it is written). But this is my last post in this direction.
    Because it is still largely irrelevant for a conducting channel. It's probably more important for subthreshold leakage. Or if you have hopping transport because of localized defects as common in organic semiconductors or something.
    I guess the conduction in metals is also caused by tunneling and not the quasifree motion of electrons in conduction bands as the same reasoning of the periodic potentials apply there too for the valence electrons and you can draw the same band diagrams. But wait! We don't have just completely filled valence bands, but also conduction bands and the dispersion relation for carriers in a conduction band basically looks like that for a free particle (if one doesn't look too closely) with an effective mass given by the curvature of the band describing it pretty well for a rough approximation (just relying on some fading glimpse of knowledge from a lecture more than ten years ago). And yes, I know it is the semiclassical interpretation (but that's usually the one people are able to understand).
    Or in other words, a Bloch wave is usually pretty delocalized. It doesn't have to tunnel from one site to the next, it simply propagates through the crystal almost as a free particle wave would. A Bloch wave is basically a plane wave modified by a periodic function for the local lattice influence. What do you think, how much importance does it have for transport properties? Is tunneling the right framework to describe the propagation of a plane wave through some lattice? Is it just me or is the usually reduced amplitude of the wavefunction after tunneling through a barrier missing in the Bloch wave solutions?
    And iirc the bandgap in undoped silicon is about 1.1 eV. I can only imagine what happens when one has different dopings and one is basically able to shift the Fermi level locally in the channel by applying about 1 V to the gate. Maybe it enables to control if the conduction band is getting flooded with carriers or not so it basically forms a submicron sized electrical switch between source and drain regions? If the channel is depleted of free carriers, then of course electrons may tunnel through the now formed barrier with a length of a handful of nanometers. But this is just something one would like to get around if it would be possible. The functionality doesn't base on it. And of course there are massive complications to this picture, when one gets into the details. But as I said repeatedly, important things first.

    To get back to your point, it is only conceivable they can use two transfers per clock cycle if it was planned like that from the start. Even the example given you mentioned, planned for that by making it flexible and configurable to use either of both modes. If it is not designed for that operation it will not work with two transfers per clock cycle.
     
    #108 Gipsel, Aug 4, 2013
    Last edited by a moderator: Aug 4, 2013
  9. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    If astrograd would explain what he exactly means with "double pumping" a bit more (I asked for that) or that it may be not "full spec DDR" and how this is supposed to work, we may be able to proceed. His prior assertions regarding the timing of such a setup didn't sound convincing. My impression is he is either talking about a locally doubled clock (roughly what intel called double pumping in their P4, but where does this clock come from out of the sudden?) in 15 out of 16 cycles or two transfers per clock cycle in 15 out of 16 cycles (how does this work if it was not designed for that operation?).
     
  10. astrograd

    Regular

    Joined:
    Feb 10, 2013
    Messages:
    418
    Likes Received:
    0
    The discussion was specifically about state switching times. Not the magnitude of current generated by a certain mechanism. Remember, this is about timings and trying to guess how they may have narrowed beyond what MS/AMD engineers would have expected.

    When you do a Fourier sum of Bloch functions to represent the wave packet of an electron it's not. The modulation coefficient to the exponential is itself an exponential.

    That's only if you pretend the ion cores produce infinitely thin delta function potentials so there is no room for the decay to take place. In real semiconductors that's not the case.

    What happens is you gain the left-moving term as you aren't carrying the function out to spatial infinity since the subsequent ion core hits you first. But again, that's only if your wave function it constructed from assuming a Dirac Comb. Real lattices don't have those idealized potentials.

    I don't really care enough to bother with semiconductor physics further either. I think we both can agree that aspects of the materials that affect the timings are in flux during manufacturing tests, yes? Can we also agree that MS/AMD would have been very conservative in their designs/expectations for a 32MB pool of eSRAM?
     
  11. astrograd

    Regular

    Joined:
    Feb 10, 2013
    Messages:
    418
    Likes Received:
    0
    I simply mean that during a single pulse you can read (rising edge) and write (falling edge). For that to be possible you need to be able to go from a read to a write operation within the width of that pulse, which sound like it's what your bolded part there is talking about. Seems to me that requires being able to tell your eSRAM to go from looking for a rising edge to looking for a falling edge. The time it takes to change those instructions may have somehow shrunk enough to fit within half a clock cycle.

    Of course, if you raise the clock you shrink that pulse width and doing so too much means you encroach upon the time interval required to change from read to write. If you upclock too much, the pulse width becomes too thin to allow for the read to write transition to happen before that falling edge hits.

    This was why I conjectured about it in the first place based on you (I think it was you) noticing the 16/15 factor. It just seems really coincidental for that to be the exact factor they raised by given that it would narrow the pulse width precisely to the point where any further and you lose your ability to read/write on that same pulse.
     
  12. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,679
    Likes Received:
    6,748
    So no double read or double write? Strictly read on one edge and write on the other? This would have to be a capability of the memory controller as well. I don't see how the ESRAM and controller could do this if they weren't explicitly designed to operate on both clock edges.
     
  13. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Okay, it was not my last post about this.
    These effects influence more power consumption and maybe reliability (especially when the difference between on and off gets fuzzy at lower operating voltages). For timing at the nominal voltage it's maybe a second order effect. As I said, largely irrelevant.
    The wavepacket of a conduction band electron is also delocalized. Valence electrons are the more localized ones. There is a reason one draws the conduction band usually above the barriers between the lattice sites.
    No. Afaik, the Bloch waves are the general solution for the stationary Schrödinger equation in a periodic potential irrespective of the exact shape of that periodic potential. Another potential just results in a different Bloch function one has to multiply to the plane wave equation to get the Bloch wave. Bloch waves of course neglect a few things, but not that.

    ===============
    While we can certainly agree that simulating the timings before getting silicon back is a daunting task, I would wager the fabs have usually pretty good data on SRAM arrays. The early test wafers during process development often contain a lot of it.
    Sure one can be conservative and all, but it still doesn't allow to switch the operation mode if it was not designed for it. All what a large timing margin usually allows to increase the clocks accordingly (as long as the power consumption doesn't get an issue).
     
  14. astrograd

    Regular

    Joined:
    Feb 10, 2013
    Messages:
    418
    Likes Received:
    0
    I was told back in May that the eSRAM's bandwidth was much higher from a theoretical pov than what VGleaks had said due to the timings of the eSRAM which weren't expected to turn out as they did. I was also told that the real world bandwidth was over 130GB/s. That's all the info my source had on the cause of the boost. Here is what DF's sources said:

    The math works out such that if you are capable of reading/writing during the same cycle for 7/8 cycles you get the quoted 192GB/s figure.

    Since I was told the stated info WAY before DF was, I'm inclined to believe that DF didn't just make it up nor is it at all likely their multiple sources did either (that'd be one helluva coincidence). So here we are.

    I submit that timings are something important as that's what I was told.
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,548
    Likes Received:
    4,700
    Location:
    Well within 3d
    They would have discovered something that they provisioned for when they designed the device.
    The design was likely set for quite some time, at least since last year going by tape-out rumors, and its validation tests and debug checks would have started development and alongside the simulated and then fabricated hardware.
    Unless the possibility was targeted earlier, this new feature would have been sprung right before mass production without the testing and validation occurring alongside. Instead AMD and Microsoft would have blown months of work on the interface they did design, but not the one they ended up with.

    It is very much a bad thing in this context. The memory pipeline has a series (likely a very large set) of defined state transitions it has to make, and there are conditions set when it must have certain outputs relative to clock signals and active or inactive signal lines.
    A single-pumped interface would accept commands from a client during one cycle, and an operation like a read would return a result in a future cycle.

    I find it difficult to accept that double-pumping as it is commonly described--and DDR memory in particular--can be just wished into this. There would be stages in the pipeline that were built to have a valid state matching a read being performed, and now instead at the end of the valid time window we have the state belonging to a completely different write operation.
    If there is a way for the units to pick up on the falling edge of a clock and to properly arbitrate two accesses concurrently, there is going to be hardware and defined state transitions for them to handle it.


    Misjudging performance by almost 100% is not what I consider a reasonable expectation of the competence of the engineers involved.

    I'm willing to stomach things ranging from slightly goofy developer documentation, development programmers writing code loops that don't correct for cache and memory coalescing effects, or just bad communication and abuse of terminology. The latter is not too difficult in my mind to accept when dealing with DF or Vgleaks when we are going by the interpretation of non-technical writers, who in some cases may not even be native speakers in the case of Vgleaks.

    I consider the claim that the engineers that built it got things this wrong to be an extraordinary one that requires evidence of the same caliber.

    My point is that memory arrays are accessed after a defined sequence of signals being asserted and deasserted over one or more clock cycles. Deciding to make one part of the memory subsystem perform two operations in the same clock cycle requires that the control signals that feed into it and the signals it sends down the pipeline can also be changed as quickly. If it was never an intended option, then someone should have started asking why all this redundant, unusable, or unstable hardware was being put into place.

    If the memory port transitions faster than the control and arbitration logic, it is changing its outputs and making state transitions in the latter half of the clock cycle based on inputs that are not changing. Why we would expect a valid output from this is something I would want justification for.
    If the hardware is able to change the inputs that fast, or it is able to respond to a separate set of signals, then it is something that was put in place at design time.

    eSRAM, no, but a number of designs have done better. Intel has had had high-end caches of over 20MB of SRAM on-die since 2005, on-die since late last year. Off-die, 32MB SRAM caches were used by IBM since 2007.
    AMD hasn't had monolithic caches of that size, although it has had cache hierarchies of 16MB on-die at much higher speed with full coherence since 2011--32MB if counting the coherent MCM products.
    It's not undiscovered territory as to how it should function, although making it consistently manufacturable at the sort of yields a low-margin part requires is something AMD has not had a chance to prove itself with.

    I'm not looking to find something unexpected in the space between 16 and 32 MB. The SRAM arrays in those caches don't interact with each other, and the memory pipelines themselves don't need to behave all that differently beyond a few extra signals to select from a slightly larger set of sub-arrays.



    I'm going to ask for clarification on the quantum tunneling theory, not in terms of physics, but in terms of why it should matter at the level of abstraction or the macroscopic level of the interface.
    Quantum tunneling is an effect exploited for flash memory, at the level of the individual floating gates and their oxide layers that measure few nanometers in width.

    There are quantum effects that are dealt with or mitigated on a transistor or silicon layer basis when evaluating their performance and reliability, but the fabrication and design engineers thus far have been struggling mightily to keep it out of the protocol and state layers of abstraction, which is where the SDR or DDR mode of operation would be recognized.
    What happens in the span of a transistor's gate stack is something where we have perhaps tens of atomic layers, but the logic layers and state transitions govern a macroscopic set of pipeline stages, SRAM arrays, and wires that extend multiple millimeters.

    True quantum effects that dominate the actual function (turn on or off) of transistors or logic instead of their performance (the slope of the voltage/current curve) are not expected until below 7 or 5 nanometers, which is also where many observers expect silicon scaling to just throw up its hands.
    This leaves current silicon an order of magnitude higher than that point, and the actual units and hardware in Durango another one or two beyond that where quantum tunneling is meant to be a significant factor or scaling killer.
     
    #115 3dilettante, Aug 5, 2013
    Last edited by a moderator: Aug 5, 2013
  16. aaaaa00

    Regular

    Joined:
    Jul 24, 2002
    Messages:
    790
    Likes Received:
    23
    The controller and device were capable of being configured to sample on just the up portion of the clock, just the down portion of the clock, or both by setting a few hardware registers.

    Note that for the thing I was working on, this was designed in from the start.

    However, it doesn't actually take that many more transistors to make something that can be mode switched, or have adjustable clock dividers, etc, so you can get maximum manufacturing and deployment flexibility.

    Also, if you're licensing a macro from someone else for your design, it's possible they already threw in a lot of this stuff for you, and you just end up qualifying what clocks and voltages you can run at, and which features you can turn on and which you have to disable given the parameters of your design and the electrical characteristics of your layout, the actual manufacturing quality of what you get back from the factory, etc.
     
    #116 aaaaa00, Aug 5, 2013
    Last edited by a moderator: Aug 5, 2013
  17. Renegade_43

    Newcomer

    Joined:
    Jul 23, 2005
    Messages:
    36
    Likes Received:
    10
    Here is a fairly recent patent about SRAM double pumping. There are several other articles listed along with it.

    It describes making SRAM with double pumping ability without requiring either much more die space for more ports or higher internal clocks. It seems to make sense to have this functional possibility built in if it doesn't have a large additional cost.


    http://www.google.com/patents/US7643330#forward-citations

    "In sum, a two-port SRAM design is presented with an associated die area comparable to a one-port SRAM. To achieve area efficiency, the read and write ports are restricted to mutually synchronous operation, which represents the common usage model for many applications. By restricting both ports of the SRAM to synchronous operation, a dual-pump timing model can be introduced, whereby one pre-charge cycle may be eliminated. By eliminating one pre-charge cycle and allocating one read and one write time slot within each clock cycle, the SRAM design can provide the functionality of two access ports that operate in an edge-triggered clocking regime.

    While the forgoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. For example, aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software."
     
  18. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Some stuff of course needs to toggle faster (or need to be added in the first place) for this to work. Maybe it's the most basic version, no idea. I think there are multiple patents on dual/multiported SRAM using slightly different multiplexing techniques to access the data on SRAM arrays with less ports then exposed to the outside.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,548
    Likes Received:
    4,700
    Location:
    Well within 3d
    SRAM that can be read and written at the same clock is, as mentioned in the patent, frequently done by adding transistors to the cells and adding area so that there are separate ports to handle the operations.

    For a design with density rated a top priority, like a GPU, that is prohibitive.
    The patent, from my parsing, performs the read and write over the same wires by assigning different phases of the clock to each operation, with the read and write sharing the single precharge operation per clock as opposed to the read getting its own precharge on its own lines and the write getting its own.
    There's a control unit and IO which is shown explicitly taking the necessary inputs and changing the select signal at double the reference clock in order to get sensible output from the arrays.

    I'm not sure of the full range of tradeoffs, apart from some controller complexity.
    One tradeoff is that the clock cycle is now longer.
    It must contain a common precharge, a read, the wait period between the read and write, and a write in a single cycle.
    A true dual-ported SRAM would have two precharges and the read and write in parallel.
    I'm not sure if there's extra time required to make sure that the write has time to stabilize on lines that are not fully precharged thanks to the read that came before. I haven't gone through the legalese to see where this is addressed.
    There are some general problems in the smaller nodes with noise margins and variation that might be affected if the prior read can nudge voltages away from the idealized precharge levels (which tuning circuits would be calibrated to), although that can be compensated for by extra time or making the write signal stronger so it can swamp the effect faster.

    For a GPU, the relaxed latency requirements and larger amount of register capacity make this more acceptable than they would in a CPU.
    Given the owner of this patent, I expect Durango wouldn't be using this particular design. ;)
     
  20. MrFox

    MrFox Deludedly Fantastic
    Legend Veteran

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,996
    What could explain that it's performance is finicky? The theory of having 8 banks with one in conflict makes a lot of sense (88%), but why is it so far from it in real world?
    Is that 33% typical for bidirectional ports, because the usage pattern isn't very symmetrical with real code?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...