They are clearly read + write during the cycles. Based on what else we know, it can't be on all the cycles. I'm open to alternatives, but the end result, whatever the mechanism, is very obviously in line with the DF article which so many here and elsewhere dismissed out of hand.
I'm afraid I can't speak for other posters' opinions.
There were enough people trying to reconcile a number of points based on a very imprecisely worded article from people whose testing methods were not disclosed.
The insinuation that the people building the memory subsystem could have missed something this significant is in itself an claim of significant import, which the article did nothing to explain.
If there was uncertainty about what numbers would be finalized for the latencies, or bugs that needed a stepping to fix, or a decision to change a control register to allow reads and writes to issue at the same time in a late firmware revision, it wouldn't be seen until late in the process by those further out from the teams involved with the actual manufacturing or design.
The software teams that are coding apps or benching the system are likely more numerous, and outside of the smaller teams with more detailed understanding of the system.
Their greater number increases the chance that you'd find one in contact with DF, and the reduced exposure to the subsystem details would make them more likely to be surprised.
Unrealistic example of a non-double pumped storage array that relies on heavy banking to allow multiple writes in a cycle:
eSRAM is broken into rows in a "horizontal" orientation.
Those horizontal blocks are subdivided into vertical banks.
There are two semi-linked paths in the eSRAM pipeline: one for reads and one for writes.
If all conditions permit, both paths can execute an operation, otherwise it just does one.
Parameters:
Interpret Microsoft's slide of 256 read&write as there being 256 bits of bandwidth in each direction.
16 banks, each 256b wide.
Optimized for access patterns that don't frequently overwrite themselves.
Rules for dual issue:
No simultaneous hits to the same bank on the same line (pipeline can wake up another line for same bank, but with penalty that drops bandwidth below 7/8).
Varying latencies for different operations--will delay issue for one pipe if not met. (The numbers are chosen for simplicity.)
Borrows the theme from the patent, where operations can share setup work--some more than others.
One constraint is that the ideal pattern must allow access to all banks for reads and writes. It may be possible with different parameters to create high-bandwidth patterns that isolate reads and writes to specific banks.
Cycle number for:
Read after read: n+16
Write after read: n+2
Read after write: n+15
Write after write: n+16
Cycle#
Wn = Write to bank n
Rn = Read to bank n
The bank numbers don't matter beyond denoting when conflicts can arise.
Code:
0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2
W0 W1 W2 W3 W4 W5 W6 W7 W8 W9 WA WB WC WD WE WF W0 W1 W2 W3 W4 W5 W6 W7 W8 W9 WA WB WC WD WE WF W0
R2 R3 R4 R5 R6 R7 R8 R9 RA RB RC RD RE RF R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 RA RB RC RD RE RF R0 R1 R2
edit: (fixed some of the following text)
There are two cycles every 16 that single-issue, that is 30/32 of potential operations.
The wait states in question are measured in cycles, and could be tweaked with control registers, leading to more or less peak and sensitivity to un-ideal patterns.
The difficulty code would have when trying to match this pattern is that any accesses that don't fit the profile will lead to single-issue cycles. If concurrent programs are working separately, each can throw off the pattern, leading to bandwidth bouncing somewhere between peak and minimum.
Errr...what length scale are you considering "macroscopic" here? Don't confuse 'macroscopic' with 'quantum effects'. Not at all the same thing.
I agree that it's not the same thing, as the quantum effects are either drowned out or become part of larger electrical or thermal trends. Those trends themselves are a layer or two below the functions in question.
The eSRAM will likely be in the tens to hundred or so mm2. It will have components stretching millimeters in length, and involve disparate and complex components with billions of atoms in a thermally and electromagnetically noisy environment where behaviors are accounted for in classical terms.
And there is no leap in the narrative. It's openly speculating how the 'surprise' aspect might have occurred. Nothing more. There's no "narrative" to speak of there.
There is a chain of events and a process for designing and manufacturing a device like this.
The companies, engineers, and design teams involved in this have a very strong interest in characterizing their respective chemical, electrical, and digital product to a very high degree.
Saying that quantum tunneling can cause a double-pumped bus imputes a lot about those teams and what happened in the process.
^^^Seems pretty clear to me that they are suggesting that during the final production runs they were seeing things they didn't expect relative to the initial production runs.
I brought up multiple things that can change during the trial silicon phase for the chip: physical manufacturing refinements like process tweaks, bug fixes, low-level setting changes, and potentially base layer changes.
The quoted text said something was observed in the near-final silicon, not what changed to bring it about.
Only 1 of those matches the actual figures (bolded for emphasis). The rest do not.
You said the math of the upclock was too close to be a coincidence. I thought you were referring to how the upclock brought the almost-double peak of the 800 MHz GPU to have a final peak that was 2x of 128B x .8GHz.
I gave four other combinations where the math works the same, all linearly close clock increments that were frequently used for CPUs back when they were in that speed range.
If they aren't timed perfectly you can get the nth cycle to be off and incapable of read + write.
The array accesses are synced to the clock signals. They don't start or end without the clock's transition, so what is off?
For instance, there is a finite amount of time that is required for these elements to be prepped for the upcoming operation and if the time it takes to swap from a read state to a write state. If that is longer than half a pulse, then over time after n cycles you will accumulate a shift where this pair of states can be on the cycles and they can shift enough to be in a read state (for instance) only to see a falling edge hit them first.
That would cause them to miss their opportunity for that read state to actually perform the corresponding read operation. Instead it would have to wait until the cycle completes and the next rising edge comes around.
Unless the hardware is designed for it, the falling edge isn't critical beyond the fact that it needs to happen for there to be another rising edge.
The prep work should also be synchronous, so it would be interesting to see how this would work--or be considered working.
The more likely physical result of this state of affairs is that the array performs the same transitions when prompted by the clock as every other cycle, but will read invalid data.
This was what made me initially speculate openly about size requirements for quantum fx as quantum mechanics is what governs the transition times to change states,
The analog behavior of the circuits governs to the times. Quantum effects manifest in aggregate as the material and device properties of the components of those circuits.
There are certain phenomena that are manifestations of quantum effects. Tunneling can turn up as a mechanism for leakage. Improved SRAM performance can come from as little of it happening as possible, rather than more of it.