Any idea on what the penalty between a switch from a read to write is?
The number varies. In this case, there can be latencies inherent to the device or the bus between the controller and device.
The model in question can have different numbers, and it can have programmable settings that can change the math as well.
http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf
A write to read delay is 5 nanoseconds of an internal turnaround time. 5.5 Gbps memory would make that roughly 7 clocks.
For a switch from reads to writes:
[CLmrs+(BL/4)+2-WLmrs]*tCK
CAS latency can range from 5-20 cycles, BL is 8, Write Latency can be from 4-7 cycles.
The middle of the road CL numbers would put a penalty roughly around that of the write to read penalty, although fiddling with numbers can make things better or worse.
Since the data clock is twice that of the command clock, and the data bus is DDR, that's 28 data transfers that don't happen every time modes switch.
More than this, the pdf shows a massive number of variables, restrictions, and options that a merely functional memory controller must manage, not to mention what would go into a good one.
So would it be fair to say that you do still need a 50/50 split of read/write to achieve maximum bandwidth utilization of the XB1 esram? However that in such a use case GDDR5 would likely be less efficient because of the switch penalties you mentioned?
The inferred ideal is 50/50 for the eSRAM, and that does seem like the most straightforward interpretation. The optimal mix of type and locality hasn't actually been disclosed, however.
GDDR5 can do well enough if the mix is heavily in favor of one operation or the other, with the additional caveat that many systems can give an edge to read performance, because automatic functions like hardware prefetching work for reads but can't help with writes.
It also does best with accesses that don't jump around to closed banks. The eSRAM's situation with regards to where the accesses hit relative to each other isn't known, although it sounds like it should have an easier time with it.
If that's correct (please just blast my theory out of the water if it's not) then would you care to venture an estimate of how a typical games workload may best fit either model from a high level (I understand at a low level there would be a mix) or could it vary greatly from one model to the other depending on the game engine?
I'm not sure what would pass for typical for modern games, and I haven't run across analysis of that sort for games.
I am intrigued by the prospect of less regular accesses being sped up by the eSRAM. I wouldn't say irregular accesses because it does seem to have a decently coarse granularity, going by the width of its ports and its apparent alignment in bandwidth to the ROPs.
The Edge article hinted at certain things that had higher write needs and less ALU and texture work that could be used to compensate.
Going a little far afield, I'm curious if measurements of performance will show that the eSRAM can speed up the setup process for intermediate buffers and render targets. Comparisons between various forms of deferred lighting or shading show that there is overhead for tile setup that is eventually swamped by the total amount of work in larger data sets.
That could open up the possibility for more buffers that can be made smaller and used in more complicated ways because their overheads are smaller.
On the other hand, once the setup is complete and the goal is to power through a lot of generated data, the other memory setup could have an edge.
Different phases of the same graphics pipeline could favor one setup over the other, if the programmer is able to delve deep enough to uncover the differences.