Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

london-boy said:
Ok uhm sorry to sound on-topic, but after about 40 pages of thread pertaining PS3, has the question been answered?

Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?


:smile:

Well, there will be no definitive answer to that question. There will always be some applications for which Xenon is faster than CELL and some which CELL is faster than Xenon. So people will argue forever. :)

My opinion is that on balance CELL will be significantly higher performance for the type of workloads that Sony was targeting (e.g., physics, GPU assist, video decode/encode, encryption/decryption, AI, etc.) Part of this is due to the high number of execution engines in CELL (i.e., all the FLOPS that everyone talks about). But I also believe that CELL did a much, much better job on attacking the "memory wall" problem. Bandwidth both internally and externally is significantly higher for CELL and as I pointed out in my previous post, the LS was designed for much higher performance (lower latency & higher bandwidth) that the Xenon L1/L2 architecture (again for the workloads Sony was targeting).
 
Fafalada said:
SPE LS can read/write 128bytes/cycle, which makes it in order of 400GB/s. So no, the L1 caches aren't really faster.

So, um this gif is wrong? Because there's an arrow there, and it's pretty clearly labelled "51.2 GB/s".

http://www-128.ibm.com/developerworks/power/library/pa-cellperf/figure1.gif
figure1.gif


Plus, why would the LS need to read/write 128 bytes/cycle, when at most you could execute a load quadword (128-bits) into the SPU once per cycle, and the EIB bus interface can read or write 16 bytes every other clock?
 
Last edited by a moderator:
ihamoitc2005 said:
Total LS bandwidth is 8 x (25.6x2) = 409.6 GB/s

If you want to count all the LS's together vs all the L1 caches together, sure, Cell has 400 GB/s vs 300 GB/s for xenon L1 cache.

But I was only talking about per core - one xenon core vs one SPE.
 
aaaa0 said:
So, um this gif is wrong? Because there's an arrow there, and it's pretty clearly labelled "51.2 GB/s".
That's bandwith to Register banks only which obviously can only consume 16Bytes/cycle.

Plus, why would the LS need to read/write 128 bytes/cycle
Because it must guarantee predictable memory accesses.
Among other things like cumulative bandwith demands, 128byte/cycle read/write serves efficient intruction prefetch (1 fetch cycle every 32instructions).

Anyway it's odd that you missed this diagram which is on the same page as the gif you linked to
figure3.gif
 
Last edited by a moderator:
Hold on:

figure3.gif


Looks like the bandwidth to the instruction issue unit is 64 byte/cycle, the bandwidth from load/store is only 16 byte/cycle.

Who cares what the bandwidth is internally between the LS store and it's controller? I'd assume the bandwidth between xenon's L1 cache and the cache control logic is much larger than that of the actual usable cache bandwidth as well.

What matters is what the SPU can get at when executing code, and what the EIB can get at when it needs to put data into LS, no? And that is 51.2 GB/s to the SPU, and 25.6 GB/s to the EIB.
 
Last edited by a moderator:
Titanio said:
So I guess that's actually 8 x 400GB/s = 3.2TB/s?
Yep, but EIB (and XDR) can't obviously sustain that.
That huge LS bandwith is there to make sure DMA read/writes and intruction fetches don't significally stall code running on a SPE.
 
aaaa0 said:
Looks like the bandwidth to the instruction issue unit is 64 byte/cycle, the bandwidth from load/store is only 16 byte/cycle.
First you accumulated the instruction cache bandwith with D-Cache and now you're trying to argue that instuction read bandwith is irellevant in LS.
Unless IBM engineers are so smart they found a way to store instructions in vaccuum, that bandwith is right there, and it certainly counts.

As noted twice in this thread now - LS is single ported - that's why it was designed with "redundant" bandwith, to eliminate stalling as much as possible.
 
Fafalada said:
First you accumulated the instruction cache bandwith with D-Cache and now you're trying to argue that instuction read bandwith is irellevant in LS.
Unless IBM engineers are so smart they found a way to store instructions in vaccuum, that bandwith is right there, and it certainly counts.

As noted twice in this thread now - LS is single ported - that's why it was designed with "redundant" bandwith, to eliminate stalling as much as possible.

Ok, I withdraw the point.

If we ignore instruction bandwidth, they're the same.
 
Last edited by a moderator:
aaaaa00 said:
Ok, I withdraw the point.

If we ignore instruction bandwidth, they're the same.

You're arguing the wrong point anyway.
The trade off of the local store is that it requires data locality, whereas a multi way set associative cache doesn't in the same way.

If you can stream data, the data source is large and your doing minimal work on it the LS will be a win, prefetching should bring them to parity if the work your doing/fetch is significant. If your walking a data structure without significant locality (say a tree, or an STL list), then the cache will win.

It's all trade offs.

In an ideal world you'd have both.
 
Titanio said:
So I guess that's actually 8 x 400GB/s = 3.2TB/s?
The 3.2TB/s theoretical bandwidth is there, but to reach it you'd need to basically fetch a new line of instructions on every cycle, ie skip 32 instructions ahead each cycle, there's not alot of code around that does that.

So the 3.2TB/s number is pointless if you ask me.

Like others have said, the peak bandwidth is there to minimize (data-access) contention for the LS

Cheers
Gubbi
 
Maybe you've discussed this, but how much integer power is needed in games? I mean, would there be situations where the PPE won't be able to handle what's not best done with floating points?
 
weaksauce said:
Maybe you've discussed this, but how much integer power is needed in games? I mean, would there be situations where the PPE won't be able to handle what's not best done with floating points?

It doesn't much matter because the SPUs can handle integer work at the same speed (or faster, if you're working on smaller values).

A more pertinent question would be, how many types of operations absolutely require frequent random access of a large amount of memory and cannot be refactored to be more coherent? (and note once again - this is simply a matter of performance, not capability...)

My own answer to that would be "we don't know exactly, because no-one has really tried yet".

There are loads of algorithms where currently we access memory with wild abandon, simply because it was easiest to write them that way and it's not too bad on a typical CPU (though it's never the most sensible thing to do). However that's not to say that they *have* to be done that way.
 
Major Nelson

weaksauce said:
Maybe you've discussed this, but how much integer power is needed in games? I mean, would there be situations where the PPE won't be able to handle what's not best done with floating points?

Here is link to very funny article where false image of SPE=DSP is created.
http://www.majornelson.com/2005/05/20/xbox-360-vs-ps3-part-1-of-4/

Because of silly claim that SPE=DSP, he only has PPE integer capability in comparison. This article is good lesson in advertising.
 
Back
Top