First Cell Benchmarks

MarketingGuy

Newcomer
All good possibilities. And I agree that more letters is better (although I would have chosen something better than SPU for an abbreviation). ;)

But that's the possible. Synthesize it to a single probable. There is a simple answer here that has to do with the difference in the stack on top of a quad core and the stack on top of a Cell. (And, no, it's not that 9 > 4 .)
 

ebola

Newcomer
In the case of the 2 Cell blade versus the 2 Woodcrests, it is a matter of 2 coherent caches for Cell for the usually non-critical PPEs, versus a more complex set of caches that on average looks like something between 8 and 4 caches for the performance-critical cores on the x86 platform.

Ah ok in this example - 2Cell blade - there's still going to be off chip communication going on;
so it's not necasserily internal bandwidth. It's still having to divide up into multiple independant working sets with main memory being the final point of sharing.

on the 9 v 4 issue - whats the count of execution units (and functionality per unit..)?

"What's the difference from a software point of view between a peer core and a synergistic core?" .

Well the biggest difference from a software point of view is the manual async cache management/"distributed memory" ???

In porting regular code to DMA you're having to encode not just the locality, but also assumptions about object ownership per thread.
It's almost like the concept of 'syntactic Salt' at the program design level- a hoop you have to jump through to prove you know whats going on.

You've done this work at design/compile time, so the processor doesn't need to waste silicon/watts on guessing it at run time.

Isn't it <20m transistors for an SPE, vs approx 50m for a Xenon core or PPE. (and god knows how many for a Core2.. but i suppose you have to count execution units x clock etc to make a comparison) So thats where the benefit of the approach shows up...

I know the cell also gets a benefit over other processors for it's clean SIMD based instruction set.. I've no idea how much that contributes here.

The question some of my colleagues have is "but how much of this can you do with decent cache control instructions". - a properly shared L2 should be able to do the job of inter-LS transfers? - Prefetch +(Hyperthreading?) could help deal with cache misses (without the expense of oooe)? - maybe increase the line size to get the effect of the larger dma transfers?
- the difference is the wasted control logic, right ??

when i say "i like the clarity" - i like the fact you appear to be able to take more implementation decisions based on reasoning (one big factor being code coherence, you actually know the size of each module earlier..) rather than measuring random cache effects :) that seems to point to something being very 'right' about the cell.

Whats Larabee going to do. I had heard that they definitely won't do cell style DMA "but for high thoughput they will definitely extend the memory model , you wont be far off if you think for the cell". locked cache lines?
 
Last edited by a moderator:

MarketingGuy

Newcomer
Well the biggest difference from a software point of view is the manual async cache management/"distributed memory" ???


Getting close. Think about management of misses a bit more... again in the context of stacks on peer cores versus a stack on Cell.

There really is a simply "a-ha" here... so simple that even the Marketing Guy undderstands it. :D
 

ebola

Newcomer
Getting close. Think about management of misses a bit more...

So I was thinking about the alternative to Cell - "massive threading" to hide latencies - a la GPU, or Niagara, (i'm almost including OoOE as a type of threading here..)....

... all those approaches are going to waste a lot of fast, close-to-ALU, on-chip memory on execution states that are actually idle, waiting to be re-activated when the requested data turns up?? (or be swapping them in & out..)

By 'on chip memory' i'm including (rename) registers or whatever. (gpu's have a lot of internal register renaming going on dont they..).


So - is that "the answer" you are getting at? or is there something else..
Ok this seems to be why I personally find the cell more exciting than all alternative devices - the large working set. 256k of fast memory seems very exciting in terms of scope for algorithms..
And I suppose you could state this as 'Cell maximizes utilization of on chip ALUs AND on chip-memory' which does sound like it's already been mentioned.

With the Cell of course this 'management' is moved into sorting tasks by data. But this works just fine for game engines.

normal processors have prefetch but you can't go prefetching entire chunks of BSP tree or whatever like you can with the MFC.
 
Last edited by a moderator:

ADEX

Newcomer
Getting close. Think about management of misses a bit more... again in the context of stacks on peer cores versus a stack on Cell.

There really is a simply "a-ha" here... so simple that even the Marketing Guy undderstands it. :D

I think you're referring to the manual control of data on Cell. On a traditional CPU memory reads are abstracted, you have no idea if data is in cache or in RAM.

On Cell it's all explicit, you know what is in LS and what isn't. Because data transfer is explicit you can manage it so that processing is going on while data is being transferred, this allows you to completely hide RAM latency.

Doing this on a traditional processor is possible but with nowhere near the same degree of certainly - you can use cache control instructions but you don't actually know if the dat gets to cache in time or another task is going to switch in and flush the cache... On Cell you know what is in the LS, if you had pre-emptive task switching (actually you do to a degree now) your app will return with the LS in the same state is was before the switch - you still know with certainty what is in the LS.

This feeds into the point I made above - once you know things like this you realise how completely inefficient normal CPUs and programming are, it becomes natural to write optimised apps on Cell because doing it any other way is completely pointless.
 

MarketingGuy

Newcomer
OK, OK -- I just wanted to see how people are thinking about this. The answer I was looking for is just this:
The OS runs on each of the peer threads and none of the synergistic threads.

This leads to the observed effect that the variability of the performance is a lot higher on a peer-type processor -- you never know when a computational thread will be interrupted to go serve a page fault for some other process. Sometimes you're lucky; sometimes you're not. Variability.

Of course, it's possible not to run the OS on the peer cores, but neither the chips nor the OSes tend to be architected that way. Instead, the assumption of "peer" carries right through to having a full stack on every thread. (I had an intentional "slip" in my previous post, trying to give that away.;) ) Cell and the Linux support for Cell explicitly make the opposite assumption: while you could run an OS on an SPU thread, you probably don't want to.

But, hey, I'm just a Marketing Guy. Maybe the above is just assumed by you all and is only news to me.
 

Arwin

Now Officially a Top 10 Poster
Moderator
Legend
Not something I have thought about earlier. There'll probably be some people here on the forum who can confirm whether or not that has a real impact on the issue at hand though - I wouldn't be surprised if it wasn't really applicable, but may it is?
 

ebola

Newcomer
Maybe the above is just assumed by you all

yeah pretty much. hehe. Well it was fun getting all the Cells' strengths listed on one thread..

In the context of console games programming you're most likely to be comparing the PS3/Cell to it's rival, the xbox 360;

For that machine we'd implemented a simple co-operative task-manager rather than relying on system threads ... and wrapped allocations in various specialized pools so that they're not hitting the page table - avoiding all that horrible variablity. All standard stuff for games... so it's very unlikely that the OS is going to impact worker threads there.

But clock for clock/ core for core we are still seeing speedups when porting most code over ( in varying states of optimization, very little 'fully tuned' due to time constraints ). The explanations I have are [1] the hastle of porting forces you to create more coherence (and you exploit the large local store to avoid writing insane software caching), and [2] the ISA is very efficient, particuarly at conditionals(i dont mean branches) & mixing data types... (i think it's ironic that it gets knocked as harder to program - harder to port generic code, sure..)

Actually I suppose "not running the OS on the synergistic threads" could be considered as showing up in the simplification of the load/store pipeline - no tlb or caching logic to contend with.
 
Last edited by a moderator:
Top