Design your own Cell

OMG Acert, that paragraph complete with bolded letters is like maximum eye roll! :p

nAo's architectural/instructions improvement list aside, my own entry into the mix in this thread would be a pure 'mild' evolution. On 32nm HKMG as I proposed, I'd say 2 Cells essentially on a single die - with the PPE cores evolved to more robust/functional entities, and the SPEs being of the PowerXCell variety by default (minimum logic increase vs the originals to begin with). Assuming density and thermal gains come in line with the hopes for the new process, I'm pegging this 'Cell Evo' chip at ~150mm^2 with thermals at roughly the present 45nm SOI Cell's... maybe hopefully even below. Clockspeeds I'll leave at 3.2 GHz in consideration of the beefier PPE replacements, memory controller of course updated... and here I think we have a decent chip to serve as CPU in a console. Yes, in this vision the CPU and GPU remain separate once again, but with the additional SPE power I do envision a smart design having a GPU more specifically tailored to the environment.

Now... so with this ~150mm^2 Cell Evo (2Power, 16 SPE), for IBM's HPC purposes I see where the new PPE replacement cores are sufficiently up to the task where blades like the QS series no longer require outboard Opterons, for instance, to coordinate workloads. IBM is freed to pursue more 'pure' Cell-based design options, the unification of the SP and DP Cell variants ensures that all Cells are commodity Cells in terms of internal costs, and to make up for the focus on small die size and low thermals, IBM can of course go for glued-die chips, or just straight up MCMs to compete with more monolithic competitors. The great scaling should favor a many-chip environment for Cell quite well in HPC, so I don't view it as having to compete with 500mm^2 chips to offer benefits in certain environments.
 
Remove the PPU, it's too slow, replace it with an OOOE core.
SPUs should be able to execute code that doesn't sit in the local store (yep, they need a proper I$), that would automatically increase the amount of data one can store on the LS and it would remove the ridiculous issues with debug code not fitting in the LS (which was so retarded to begin with).
The per SPU DMA engine needs to be improved so that it can support async gather/scatter and atomic ops.
Add TMUs, make SIMD vectors 8 or 16 wide with automatic instructions replay to easily support larger vector widths when necessary.
Update SPU ISA (it's so limited) and add HW multithreading to better hide more complex instructions latencies (couple of hw threads per SPU would be just fine)

Is it just me, or did you just ask for AMD's fusion chip?;) Or even larrabee fused with a nehalem core on the same die? :LOL:
 
If a new Cell is to be developed it will be most likely by IBM so it has to fit their needs. Basically I can see Larrabee (if it turns out well, which in regard of Intel work force is a matter of time) as the biggest threat for them. They could pass on the graphical side of things and focus on compute only.
Low power consumption is much wanted in HPC market but I feel like IBM could afford a higher TDP than the Cell one and still be competitive.
I could see the Xenon has a good start not really for the core but the memory hierarchy. You have registers, L1 and L2 but actually the L2 works as what would be an L3 in most actual CPU/X86 designs. It runs at half the speed, it's shared. The interesting part for me is that you can design "grapes" and I feel that it would be easier to handle coherency and feed for example 8 "grapes" of four cores than 32 cores with their local subset of L2 (could be completely wrong tho...).
IBM could:
Design a shorter pipeline CPU adapt clock speed if needed (in regard to power consumption/TDP).
Design it properly, lots of effort have been put in SPU, I'm not sure we can say the same about PX/PPU.
Design brand new 256 SIMD unit supporting integers and floats (why not based on SPU ISA but a super set of altivex sounds better or at least able to run in compatibility mode with degraded perfs).
Hyper threading support for the SIMD units (2 or 4 threads)
Reduce the number of registers to avoid the chip to over grow (say 64 if 2 hardware threads / 32 if 4)
Fix LHS.
Have really good pre-fetching capability
Have the L2 running at the same clock speed as the chip.
Greater control on cache behaviours.
Provide high bandwidth between the different memory levels, L1, L2, RAM

I did a mock-up of a 4 cores Xenon (with paint) it wasn't super accurate for sure (... :LOL:) but it was clear at least that the chip could most likely be a bit tinier than the cell was at launch or the same.

I'll use the cell size for rough calculation assuming a 60% scaling per node @ 32nm that ~50mm² per grape. IBM want the cell to be a bit tinier so they could choose 4 grapes => ~200mm² @2GHz.

That's 512GFlops SP and 256 DP and I think lower TDP than X86 parts. Both Intel and AMD plan to release OoO quad cores (or more) parts also supporting 8 wide SIMD units running around 3GHz in next future. IBM alternative could end using quiet less power, running cooler and be tinier (a bit tinier cores and a lot less cache), it could use the same or improved IO as POWER 7 for great scalability.
So not much of a huge raw power advantage but way greater scalability, lower TDP and power consumption and the chip being pretty tiny possibly a money marker on the HPC market.

Thing could get better if they can use EDRAM for the L2 => way tinier chip => lose some of this advantage and optimize for power/heat instead of density => aim for higher clocks and/or and some cores hitting the TFLOPs could be nice or simply have better FLOPS/Watts characteristics (my favoured take for the intended market).

They may ship something like 4/8 Chips per blade (4 sockets per blade, one or two chips per socket) with bunch of memory. Not a crazy looking design in regard to raw numbers (1 or 2 TFLOPS in DP) but homogeneous design, cooler , pretty cheap to produce, well know ISA, able to run existing code, etc.
 
Last edited by a moderator:
I'm willing to bet that there are actually surprisingly few problems that can't be turned into a more efficient streaming version.

Are you willing to spend your development times turning problems into streaming versions, instead of solving new problems? Are you willing to bankroll your programmers doing so? Just betting isn't good enough ;-)
 
MIPS looks very power efficient, but there haven't been MIPS CPU designed for speed for a while. the one for PS2/PSP was the last one I believe.
Except the CPU from mainland China (Loongson 2 series), stil meant for servers, nettops and netbooks, but pretty serious (64bit, out-of-order)

I'm awaiting the Loongson 3 CPU, a 10W quad core, with facilitated software emulation of x86. that would be a great server CPU.
not sure if that's meaningful to integrate in a Cell :p

I remember first hearing about the Loongson series a while back. Any products from China containing them? So far the only Loongson product I can forsee coming to the US is in a missile guidance system if you get my drift :p
 
Remove the PPU, it's too slow, replace it with an OOOE core.
SPUs should be able to execute code that doesn't sit in the local store (yep, they need a proper I$), that would automatically increase the amount of data one can store on the LS and it would remove the ridiculous issues with debug code not fitting in the LS (which was so retarded to begin with).
The per SPU DMA engine needs to be improved so that it can support async gather/scatter and atomic ops.
Add TMUs, make SIMD vectors 8 or 16 wide with automatic instructions replay to easily support larger vector widths when necessary.
Update SPU ISA (it's so limited) and add HW multithreading to better hide more complex instructions latencies (couple of hw threads per SPU would be just fine)

Sounds neat :), could be better than LRB hehe ;).
 
It would remove the ridiculous issues with debug code not fitting in the LS (which was so retarded to begin with).

While I agree with most points you make, don't you think this is more a compiler issue? I mean there is non-optimized debug code, and then there is going out of your way to bloat code-size beyond anything reasonable.

Update SPU ISA (it's so limited) and add HW multithreading to better hide more complex instructions latencies (couple of hw threads per SPU would be just fine)

Would be interesting to see what kind of instructions people would want, apart from DIV. ;) I'm somewhat partial toward logical boolean, to get rid of all those ceq instructions. Take some pressure off of EVEN.

As for instruction latency, I don't see it as a problem in most cases. There are software solutions around that.

The rest.... yeah, pretty much. :)
 
What new instructions do you think that IBM had in mind when they said "Performance per SPE equal or better - significantly better on applications that benefit from new instructions." in the old roadmap?
 
512k for what? 256k is very comfortable.

SPUs are not very sensitive to latency issues by design.
So crossbar will only over-compilate the things without any gain.
By the nature of current software, inter-SPU transfer is rarely used,
and 99% of time SPUs talks to MC. And while one SPU is working with MC,
the rest will be blocked.

BTW Intel uses ring busses in their current and future high performance/scalable designs.
I don't agree I think Polaris is more relevant to where Intel is heading ;)
 
Are you willing to spend your development times turning problems into streaming versions, instead of solving new problems? Are you willing to bankroll your programmers doing so? Just betting isn't good enough ;-)

Well, in my ideal world I would design an engine completely based on the idea of streaming in the first place. It would probably result in new ideas automatically.
 
Unfortunately developers don't live in ideal worlds--then again, if they did, we may see new games every 4-6 years from a studio.
 
Back
Top