So if you are doing all reads or all writes and have a nice stream of addresses you can keep the data bus busy >90-95% of the time. Start mixing writes and reads and the % drops. Use an address stream which isn't as friendly and you drop even further. Make less than optimal arbitration decisions in order to give lower latency to CPU fetches and you drop even more.
Thanks, and great job on the write-up!
Correct as far as I can see, and with better judged presentation and length than I would be capable of.
It was needed to balance your first post which created an impression of single cycle access and a 100% bus utilization. (For less experienced readers, there's one heck of a lot more to be said - the last two sentences of BobbleHeads post above are easily textbook/PhD thesis material.)
What we haven't touched much yet is - what about code that doesn't just stream large contiguous chunks of data? Also, if data can always, or even typically, be laid out optimally for streaming from a data structure point of view, never mind hardware issues such as page boundaries.
I'll introduce the problem of the first issue I raised above. If you use a 128-bit bus using 8 deep bursts, if you access a word, 1024 bits to be transferred over the bus. (Lets ignore burst chop.) Now, if all you were interested in was that particular word (say a 32-bit pointer), then 31/32 or 97% of your effective read bandwidth was spent transmitting junk data. Which, adding insult to injury, evicts possibly relevant data in your cache hierarchy.
As an introduction to my second issue, consider a simple three dimension matrix. If the data is layed out as {x1,x2,...,xn,y1,y2,...,yn,z1,z2,...,zn} you will get drastically different bus utilization depending on along which axis you traverse it, and also if you only want a single coordinate value, or the whole (x1,y1,z1) triad. Alternatively, the data could be laid out as {x1,y1,z1, x2,y2,z2,...,xn,yn,zn}, which would provide another set of bus utilization numbers again dependent on just how you traverse the matrix. And if even a simple matrix is tricky, how about more complex data structures? And if they are not organized linearly, but as for instance trees?
I'm sorry that I can't dig deeper into this -- I'm a fairly slow typist, and I'm strapped for time. But in my experience from performance computing, data flow is THE main issue. And as soon as you move away from the very simplistic cases, it gets really messy really quick. As with multiprocessing, I think it would be good if the people here who aren't active programmers gained an understanding that what we are dealing with are really thorny and complex problems, that often as not simply doesn't have optimal solutions.
(For Exophase, since I know you have an interest in mobile computing, take a look
here at the the compiler dependance of even such a simple benchmark as STREAM, even using a single compiler (and version). "So depending on gcc optimization options, we get some nice semi-random benchmark numbers." Moral: It's not trivial.)