Edit: There are plenty of funny numbers to be had with this idea. 80 Cell BBEs would fit on the silicon. Assuming no trouble connecting them up, you'd have 80x the attained ~200 GB/s SPE data access across the EIB, so 16 TB/s internal bandwidth. 160 MBs of SRAM local storage on SPEs and another 40 MBs for the PPEs' cache.
The assumption that there's no trouble connecting things up is typically the first thing such designs founder upon. The 200 GB/s bandwidth figure is something of an optimistic figure, since it is the maximum bandwidth in the case of the processors on the EIB transferring only to their neighbors. 80 such units would be 80x the memory bandwidth, as their capacity to reduce traffic is confined to each individual Cell.
Trying to have a given SPE or PPE work with another core in another Cell would take the ring bus and in the absence of a topology change introduce a hop count far beyond what Cell was designed for and drop bandwidth to 25.6 GB/s (simultaneously throttling a subset of the local ring bus during the transfer).
Cell was before it's time. It hit consumer hardware before highly parallelised multi-threaded code was prevalent in gaming and Cell relied on these techniques to shine.
I did see a reference to a DSP architecture that presaged this, the TI TMS320C80 MVP.
Single generalist master core and 4 DSPs. It didn't seem to catch on.
As far as multithreading went, such techniques were needed for other hardware in the 8 years after Cell, and so reusing them would make sense generally without tacking on the complexity of the SPEs, a master core/bottleneck, and lack of caching.
What do we need to get the cell's efficiency from a more traditional x86 architecture? I was thinking cell was fast because of the LS having ridiculously low (and fixed!) latency. It was literally like old school DSPs, 256k of registers.
I wonder if it would be possible to modify the zen architecture to have a local store to emulate cell, maybe repurpose half the L2 to behave like registers or something.
The Zen architecture's FP physical register file is sufficiently large enough to contain the SPE's register file. However, without an architectural change there's no encoding that can flatly address that many registers. The FPU itself has enough 128-bit operand ports to match what the SPE could do. The instruction latency for math operations would be better on Zen, although I'm not sure if the permute path can be so readily supported--particularly since Zen lost the more general permute instructions that belonged in Bulldozer's XOP set.
The L2's latency would be at least twice that of the SPE's LS, which would potentially lead to problems with running out of independent work or context space to compensate. Treating the L2 like an SPE register file would be worse in terms of operand bandwidth, whereas the L2 right now would be better than the LS in bandwidth terms.
I would say learning about what the drivers are for why Sony left Cell could be a useful
History lesson on what type of factors could go into their next gen device.
Part of it could have been cost. The other part developer ease. The $399 from a Cost perspective seems wildly successful and is likely at least from what I can see a larger factor for success in terms of sales than developer ease when we are looking strictly at the hardware.
An interesting point here is that if we assume Cell 2 was a possibility for this gen, and at one point in time It must have been considered, then Sony effectively traded big performance for price and developer ease.
That should be a factor people should consider in these predictions.
There was the other issue that of the alliance that made Cell (Sony, Toshiba, IBM), only one of them really maintained or advanced a microelectronics division and technology base at that level. IBM, being an expert in coherent SMP systems and leading-edge interconnects, was not a fan of the SPE architecture or philosophy. It was Toshiba and to a middling extent Sony's old-school DSP expertise and lack of capability in the technologies of the day that pushed the comparatively unsophisticated SPE and straightforward LS.
Sony split the difference to create Cell, and its subsequent overreach (had an entire leading-edge fab built for the architectural revolution that never came) mean this last hurrah of a toolset already showing signs of obsolescence ended in disaster. Toshiba, being the DSP maven, made a Cell derivative without a PPE that went on to have nobody care about.
Given what Sony sold off, spun off, or cancelled, it's not clear where it would have gotten the expertise to get a Cell 2. IBM was done with that experiment, and there was another 8 years of advancement devoted to SMP or at least cached SIMD architectures. Sony spun the SPE as being an architectural leap, but many of its underpinnings were from a starting point that was arguably archaic. By the time of the current gen, given where GPUs and CPUs had gone, there was no thriving pool of initiatives revolving around speed demon branch-averse architectures with DMA-based memory access.
even more ironicly - the Cell and its derivatives appears to be well suited for RT
->
https://www.sci.utah.edu/publications/benthin06/cell.pdf
so.. maybe a sophisticated dedicated Cell in the PS5 to handle the RT?
From that paper, it appears a big benefit is in the SIMD ISA and generous register set.
They made some decisions in implementation to work around pain points in the architecture. Running a ray tracer in each SPE rather than running separate stages of the ray tracer's pipeline on different SPEs and passing along the results. Software caching, workarounds for serious branch penalties, favorable evaluation of software multithreading for the single-context SPEs.
The architecture they were trying to write onto the hardware was a SIMD architecture (possibly with scalar ISA elements), different forms of branch handling, better threading, and caches.