Which parts of the document for a discrete SRAM component with two ports show how you can read and write data over the same bus in the same clock, leading to double the design's bandwidth?
(edit: discrete is an improper term, a better one would be "isolated")
From what I've skimmed, the peak bandwidth is what you get with two separate ports.
It looks more like it's a dual-ported SRAM with specifically outlined cases for when the inputs for the two ports lead to a conflict. The interface and control logic would be on the other side of all the control and data lines, and for an on-die version I would expect the pipeline logic to be smart enough to avoid the corner cases, especially since there are read-write conflicts that lead to unknown or old data being read back.
edit:
To summarize, I would like some exposition on why this should be considered relevant or what argument it is supporting.
I haven't been able to find anything that shows a memory that can read and write in the same clock on the same bus. Everything with simultaneous reads and writes has two data busses. That idea of reading and writing in the same clock on the same bus is a dead end, as far as I can tell.
I keep going back to this quote from Digital Foundry and I can't make sense of it.
... Now that close-to-final silicon is available, Microsoft has revised its own figures upwards significantly, telling developers that 192GB/s is now theoretically possible.
Well, according to sources who have been briefed by Microsoft, the original bandwidth claim derives from a pretty basic calculation - 128 bytes per block multiplied by the GPU speed of 800MHz offers up the previous max throughput of 102.4GB/s. It's believed that this calculation remains true for separate read/write operations from and to the ESRAM. However, with near-final production silicon, Microsoft techs have found that the hardware is capable of reading and writing simultaneously. Apparently, there are spare processing cycle "holes" that can be utilised for additional operations. Theoretical peak performance is one thing, but in real-life scenarios it's believed that 133GB/s throughput has been achieved with alpha transparency blending operations (FP16 x4).
The only thing that makes sense for "simultaneous" reads and writes is two busses. It can't be that they enabled DDR that wasn't previously working, because that would double bandwidth for reading or writing, and this says those individual operations remain at 102.4 GB/s. Oh well. I guess there's really nowhere for this conversation to go until someone leaks a clarification.