Since a cache read takes multiple clock cycles, there's a thread switch very often. It could also use round-robin on the threads that are ready. So it's not that hard to have fairness.
It would be a variant of Niagara's threading model. Round-robin unless a long-term event such as a cache miss occurs.
A cache read on x86 is so common that if you are saying that a switch can occur on every cache access that the thing would devolve into a round-robin processor anyway. The corner case is if somebody does cram some code's data into the register set, in which case a trivial loop could block every other thread without active intervention by the issue hardware.
Since a register read is part of the instruction, switching on that would be round-robin.
It only has 8 local registers. There are also three sets of 8 registers for input, output, and global data. But that's really not much different from having 16 registers and a fast L1 cache.
There are 8 local registers per register window. Niagara's implementation supports 4 windows. That's a total of 32 local registers (I think the windows are implemented by a renaming scheme, by the way) that are software-visible. They are not all immediately accessable, but they are still software-visible.
I'm sure there's some debate about which scheme is more awkward, the REX hack or the rigid windowing system in SPARC.
{EDIT: The following is not an issue if the caches are not write-through. This was covered earlier, but I had forgotten.
Also important, register files aren't kept coherent. Any write to memory must be kept coherent, so reliance on coherent memory for a core's internal spills is not without cost. }
I would expect an x86 version of Niagara with everything else being equal (this would never happen, but still) to lag noticeably, especially at high thread counts.
An AMD K8's L1 cache is only 2-way associative, so with 4 threads per core 8-way associativity would be workable. You also have to realize that with an architecture like this some extra conflicts are unavoidable, but it's the combined throughput that counts. Like I already said, the L1 cache just holds more useful data. In the future I also expect threads to become somewhat smaller. The core isn't working on one big thread doing several tasks, but several threads doing one task.
The K8's L1 cache is 64 KB and 2-way with a 3-cycle penalty. It is unlikely that an 8-way variant (8-way and 64 KB? 8-way and 256 KB?) is going to keep that latency.
If it climbs to over four cycles, then 4-way threading cannot hide the cache accesses.
With a steady increase in x86 performance and cheap but high performance additional cores that would indeed be an interesting option. But that's a big if. You can't sell a Cell as a Power processor at several times the price, and with less SPE's it's no longer an interesting high throughput architecture (compared to a dual-core Power for instance)...
Odds are, Cell is at least a process node ahead of that point. If IBM redesigned a more suitable chip at 65 nm or less, it probably could if it had the desire.
POWER6 is the server architecture, and I've read elsewhere that some people argue IBM used the Cell design partnership to test-run some of their ideas for POWER6 with Sony and Toshiba defraying some of the cost.
Since Power6 is a more throughput-intensive design with streamlined OoO abilities, we may see the compromise variant win out. It may not beat Sun's offering on power consumption or cost at that point, but it's likely POWER6 will be significantly ahead on performance.