I'm going to expand a bit on my cray point above, everything to do with performance today is about memory latency.
There is a certain class of problem where memory accesses are predictable enough that latency can be completely hidden (uses caches or overlapped DMA to local store or whatever), in these cases GPU like architectures, or SPE like architectures make a lot of sense. Computational density becomes suddenly important.
There are other classes of problem (the vast majority of them) where it is difficult or impossible to predict memory access patterns, this is where all of the R&D in x86 has gone in the past 15years, better caches, better ways to hide latency with OOO execution.
Whether you think Cell is a successful architecture largely revolves around how much of your application you believe can fit into the first paradigm and at what cost. I would argue it's increasingly less, and the parts that do make sense are as often as not as well or better suited to the GPGPU paradigm.
I think Cell is an interesting architecture, but it's overly mono-focussed.
My point with looking at Crays XMT above is that a company that up until recently made it's name on the number of sustained GFLOP's it could provide. Is now moving towards an architecture which is clearly designed to primarily hide the difficulties with memory access in large scale parallel systems over increased compute density. They're not doing that just to be different, they are trying to figure out how to run their clients software faster.
Now games aren't large scale numeric simulations. And Crays solution relies on having thousands of running threads to hide the latency which is a problem in itself.
But there are many ways to increase performance.
I'd like to add a bit to this.
GFLOPS have always been more about marketing than anything else, at least since the days just after CDC Cybers that Seymour Cray designed, when I got into computing. Marketing, however, is important.
There are at least three groups involved in HPC. One is into national/institutional status. These apply for, and receive, funding for supercomputing projects. Then there's the folks that are interested in computational science and computer architecture as a discipline in its own right. And then there is the group who use high performance computers to work on problems that need to be worked on wherever they may be found - meteorology, chemistry, et cetera. I belong to the last group, and this is the group that is typically referred to when the supercomputer projects are to be justified, whether to the public, or to politicians.
I came back to B3D because I was curious about what eventually turned out to be the BBE (I don't like to call it Cell, it goes against the original paper). At the time we were wrestling with clusters, and I was very curious what the result would be if IBM was to produce a new CPU architecture aimed at media processing rather than being an extension of an ancient design originally aimed at clerical byte manipulation. By and large I liked it, but could never use it, due to a number of factors that made it impractical. If IBM had been more evangelizing and given it a believable roadmap that they demonstrated that they would follow, maybe we would have looked at it more seriously. I don't know enough about game code to say if it does a good job at what it was eventually tasked with.
(Incidentally, take a look at
this building block for building big iron IBM is showing at Hot Chips.)
Latency and bandwidth both, or put another way, interconnects and data paths, have been at the core of most high performance computing for a long time, but they don't make for good/easy marketing. GFLOPS does. Still. Note its presence at the IBM slide above.
My eyes however are drawn to the memory interface, and more tantalizing, the chip to chip networking, since those concern areas that has proven critical to most code we've wrestled with, bandwidth and communication.
This long ramble wants to come to a few conclusions. To judge the merits of an architecture, you have to look at its design goals and how well it fulfills them. To say if the BBE is "better" than its Prescott contemporary, or the Xenos, you'd have to decide what yardstick to measure with. Second, GFLOPS are cheap and easy, and have been for a very long time. The challenge lies in making the ALU power (easily) applicable to as wide a range of algorithms as possible. There will, however, be specific problems that are suited to just about any given configuration. These will be used in marketing. Third, HPC is a complex world, and filling a cabinet with GPUs really only addresses very particular aspects/niches. Fourth, if you want to evaluate an architecture, look at the data paths. What limitations do they impose and what does that imply for the tasks you are interested in. GFLOPS isn't a very useful metric, generally. Maybe for games it is. I wouldn't know. There are people here with far more familiarity with that application area. But I suspect that even there, that particular figure of merit is just too simplistic.