Matrix multiplication and linear transformations against fixed matrices that stay resident in registers scale with FLOPs, not bandwidth.
Well, yes, but read what you write again "fixed matrixes that stay resident in registers". As I implied, I'm not doing graphics programming, so that case just doesn't happen. Basically, we tend to get limited by the number of operands we can read and results we can write per unit time rather than the FLOPs, partly of course because FLOPs tend to be very high, and secondly because our data-sets aren't small.
So asymptotically they're FLOP limited. That's why stuff like Linpack is what everyone turns to when trying to exhaust the theoretical FLOP performance of an architecture.
Well, yes, but as you say - theoretical.
Actually, the way I've used Linpack is to gradually let the data set grow from very small to very large, in order to get my own first order real data on a memory subsystem.
How is latency part of the effective bandwidth calculation? How can prefetching not ever hide latency completely? Do you mean the latency from L1 cache, which is usually non-blocking?
Prefetching hides latency insofar as you are not up against the actual bandwidth limits of the subsystem (and assuming it works optimally). But it doesn't get faster than continous data reading, and memory systems are not capable of continous single cycle bursting.
The problem with applying Amdahl's law in this situation is that it assumes that everything can otherwise be parallelized. People always figure there's "a bottleneck", but if you can't run two things at the same time it always helps to make one of those things faster. Even if you can't make the other thing any faster.
I went too far in my bastardization.
Basically, what I wanted to say was that low latency main memory, even though it doesn't do a lot for pure data streaming, is pretty damn handy to have for pretty much any general purpose computation and in a system where like the 3DS where the CPU is clocked so low vs. the main memory response time you will have a situation similar to running out of cache. I'm involved in scientific computing, and for the codes we want to run that are performance critical, managing memory access patterns is where it's at. Not having to deal with a complex hierarchy of memory speeds and sizes, and a bunch of threads that love to step on each others cache feet and fight for the same resources, makes a pool of fast memory seem wonderful in its simplicity. And not having uncached reads kill performance seems very liberating in terms of algorithm choices.
For PSP2 I think we just have to wait longer, the thing isn't even coming out until nearly a year. We'll probably see some convention slides from Sony and third parties alike that go into more detail, like we have for previous Sony platforms. What we do know is the maximum amount of CPU L1 cache (it'll probably be 32KB I/D), some ideas of L2 cache if they use the standard ARM offerings (they likely will and I imagine it'll be 1MB or 2MB), and the associativity of the caches.
For the people who will work with the systems a lot of the data will presumably be available. Will they ever make it out to the general public out of a marketing department that wants to manage information flow? That's not a given. The number of "cores" of whatever is the new "MHz", that is, the new figure of merit that you try to impress potential customers with. Beyond anything that marketing believes helps sales, someone like me is likely to have no other source than unconfirmed rumors or speculation. Which is aggravating if you're actually curious about something, and I can't help feeling that it generally sells the accomplishments of engineers short to have information about their work confined to a small circle. Nintendo never made even rudimentary Wii specs public, I doubt we'll see much solid data on the 3Ds out of them. A pity.