I made note of this in the Xbox SDK thread, as there was a bit of documentation for vector memory operations that put a few latency numbers (seemingly in GPU cycles).
A vector L1 miss can take 50+ clock cycles to serve, an L2 to DRAM miss can take 200+ cycles to serve, while a miss to ESRAM can take 75+ cycles.
I am interpreting the cycle counts as being per individual miss event, meaning the L1 miss latency would be additive with whatever secondary miss is encountered in the next level of the hierarchy. This seems more consistent than expecting the ESRAM to be twice as fast that the L1-L2 request latency, meaning that there is a bypassing of a significant portion of the GPU memory pipeline when going to ESRAM.
This seems to put 250+ for a miss to DRAM and 125+ for a miss to ESRAM, or roughly half the latency. In a different portion, there is an expectation of a texture access realistically taking at more than 100 cycles and possibly around 400 if there is a miss, so the "+" on those numbers may be very significant.
For a somewhat weak comparison, this other post contains latency values for the CPU memory subsystem:
https://forum.beyond3d.com/threads/...-news-and-rumours.53602/page-208#post-1700821
In short:
Orbis L1 to L2 to DRAM: 3 to 26(local L2) /190(remote) to 220+ cycles.
Durango is 3 to 17(local)/120(remote)* to 140-160 cycles.
*Reviewing the SDK seems to clarify some numbers that may have made my earlier interpretation pessimistic. If a remote L2 hit is followed by a remote L1 miss, latency is 100 cycles. If data is in a remote L1, there is an additional hop.
This leaves more cycles between L2 servicing and DRAM servicing, meaning 40-60 cycles are taken up by DRAM access. (A similar shift for Orbis would give a similar DRAM cycle contribution.)
Since these are referenced in CPU cycles, these are twice as fast as the cycles used for the GPU latencies.
Keeping with Durango, a CPU miss to DRAM would take ~80 GPU cycles, or roughly the same latency seen for a GPU L2 miss to the ESRAM. Since the ESRAM is not in the CPU's domain, whatever the CPU sees of data from there would take longer to get than going to DRAM.
For the GPU, the ESRAM is markedly faster. The more uniform access of the ESRAM and on-die connections likely mean hundreds of cycles bound up in queuing and bus traversal can be trimmed down, although this pace would still be glacial relative to what the CPUs would prefer (125 GPU cycles or 250 CPU cycles).
It is not clear what numbers would be seen for the ROP path, which may be more receptive to the 125 or so cycles saved.