Design your own Cell

Does anyone know (or can someone speculate) how many SPEs would be required to do decent ray tracing? I seem to remember someone hooking up three PS3s and getting it working so I always thought the next generation chip with 32 SPEs would allow for ray tracing with some SPEs left over to do their thing.
 
The Rome thing?

I think that was done using 20 or so Cell processors, but if I remember was entirely scalable.
 
The Rome thing?

I think that was done using 20 or so Cell processors, but if I remember was entirely scalable.

Do you have a link to the rome ray tracing demo? @assen IBM was planning a 32SPE 4PPE cell at whatever node. Is 1PPE and 64 SPEs that impossible when u go below 45nm?
 
A thread can only get as far as the contributors let it. I could have put this in the technical forums but those guys barely know how to have fun...

'Those guys' are the same folk that are going to respond to this thread, regardless of where it's posted; further I would add a technical thread isn't fun at all unless it's actually technical. ;)

Personally I think that this thread is sort of a bizarre exercise in the face of the other current Cell discussion, but if we want to explore hypotheticals I'll go ahead and say that our would-be chip should be envisioned around the 32nm HKMG process - die size and thermals up to the individual but supported by their application 'vision' so to speak.
 
'Those guys' are the same folk that are going to respond to this thread, regardless of where it's posted; further I would add a technical thread isn't fun at all unless it's actually technical. ;)

Yeah I was just being a bit more difficult than necessary. I did get a warning for making a sarcastic joke about bushes and rubble over there -_- I figure jokes are allowed in this section at least.

I have seem many ppl commenting on cell and what they think would make it better and what they'd leave out so I thought it would be great to have a place to speak about those things in addition to future cell processors.
 
On 32nm 4 Cells ie 4 PPEs and 32 SPEs @ 4 GHz looks quite realistic, looking at that graph posted by Crossbar. It'll consumed about the same power.

It won't match the FLOPS of something like 58xx series, but what's a realistic power consumption for CPU + GPU for next gen consoles ?
 
but what's a realistic power consumption for CPU + GPU for next gen consoles ?

About the same as for the current gen consoles: air and copper haven't changed their thermo-conducting properties, nor has home furniture expanded to accomodate for larger boxes. With the EU going full steam into eco-bonkers mode, I don't think selling consoles that consume more than ~200 W will be possible.
 
512KB localstore
EIB replaced with a crossbar.
512k for what? 256k is very comfortable.

SPUs are not very sensitive to latency issues by design.
So crossbar will only over-compilate the things without any gain.
By the nature of current software, inter-SPU transfer is rarely used,
and 99% of time SPUs talks to MC. And while one SPU is working with MC,
the rest will be blocked.

BTW Intel uses ring busses in their current and future high performance/scalable designs.
 
Not a serious thread but Ild go for
speeds exactly what they have now but 256kb->2mb-4mb
4 SPEs + 128-256SPU and remove the GPU!!
 
512k for what? 256k is very comfortable.

SPUs are not very sensitive to latency issues by design.
So crossbar will only over-compilate the things without any gain.
By the nature of current software, inter-SPU transfer is rarely used,
and 99% of time SPUs talks to MC. And while one SPU is working with MC,
the rest will be blocked.

BTW Intel uses ring busses in their current and future high performance/scalable designs.

I asked one of the Cell designers at SuperComputing 06 about increasing the size of the local stores and he said 256K is good enough. What he would change is the PPE.
 
Why on the old roadmap then does it say "More on-chip memory"? I suppose by that it means more local store, not cache.
 
It's also for code, data, and stack. It's also beneficial to double buffer your data on an SPE so that you are processing and reading in at the same time.
 
It might be comfortable for streaming jobs, or for things like your JPEG encoder or rasterizer, but for gameplay simulation it's tiny. Not all problems worth solving can be solved with a few screens of assembly.

I'm willing to bet that there are actually surprisingly few problems that can't be turned into a more efficient streaming version.
 
I'm willing to bet that there are actually surprisingly few problems that can't be turned into a more efficient streaming version.

That was the same thing said in 2005.

While more and more tasks are being parallelism and streamed and with the direction of HW it will remain important to do as many "big lifting" tasks like this as possible, the question isn't whether in an idealistic world most code couldn't be stream efficiently, the question is will this happen in the next 5-10 years and how adaptable will these methods be to development pipelines.
 
It might be comfortable for streaming jobs, or for things like your JPEG encoder or rasterizer, but for gameplay simulation it's tiny. Not all problems worth solving can be solved with a few screens of assembly.

No certainly not, but 256 kB can hold many thousands lines of assembly, maybe even a few thousand lines of high level code. 256 kB is plenty if we are talking code.
 
Remove the PPU, it's too slow, replace it with an OOOE core.
SPUs should be able to execute code that doesn't sit in the local store (yep, they need a proper I$), that would automatically increase the amount of data one can store on the LS and it would remove the ridiculous issues with debug code not fitting in the LS (which was so retarded to begin with).
The per SPU DMA engine needs to be improved so that it can support async gather/scatter and atomic ops.
Add TMUs, make SIMD vectors 8 or 16 wide with automatic instructions replay to easily support larger vector widths when necessary.
Update SPU ISA (it's so limited) and add HW multithreading to better hide more complex instructions latencies (couple of hw threads per SPU would be just fine)
 
Remove the PPU, it's too slow, replace it with an OOOE core.
SPUs should be able to execute code that doesn't sit in the local store (yep, they need a proper I$), that would automatically increase the amount of data one can store on the LS and it would remove the ridiculous issues with debug code not fitting in the LS (which was so retarded to begin with).
The per SPU DMA engine needs to be improved so that it can support async gather/scatter and atomic ops.
Add TMUs, make SIMD vectors 8 or 16 wide with automatic instructions replay to easily support larger vector widths when necessary.
Update SPU ISA (it's so limited) and add HW multithreading to better hide more complex instructions latencies (couple of hw threads per SPU would be just fine)

Talk about out of left field--have been sipping the Toshiba BE kool-aid again nAo! Would any IHV be so daring and crazy enough to design something like that ;)

nAo, if you were going to take the die space and thermals MS/Sony have for the current consoles, what would you do with this chip: 1 big chip for CPU/GPU, 1 CPU and 1 "normal" GPU, or two of these?
 
Back
Top