IBM on CELL as online games server(do physics simulation).

Titanio said:
I latterly thought of this..it is true. I guess anywhere you can control memory access ahead of time, the local store is going to be a good advantage for the SPEs, given its speed.

That is not obvious. The local stores has a 7 cycle load-to-use latency. The PPE's caches are faster. Tasks that can be properly prefetched into local stores will also fit caches really well, the contrary is not necessarily true.

Cheers
Gubbi
 
Gubbi said:
That is not obvious. The local stores has a 7 cycle load-to-use latency. The PPE's caches are faster. Tasks that can be properly prefetched into local stores will also fit caches really well, the contrary is not necessarily true.

Cheers
Gubbi

I must have got it wrong then. I thought L1 cache was comparable, but not L2 :?

edit - I'm trying to google, but I'm seeing mixed things. A watch impress article seems to refer to a 31 cycle latency for L2 cache on Waternoose (Xenon). I also see guesses made here that it's more like 11 cycles. I see references to 6 cycle read latency and 4 cycle for writes on SPE sram.
 
Last edited by a moderator:
I vaguely remember that the DD1 had a longer latency, and the Cell as it currently stands has a 4 cycle latency. Actually, DD2.0 Cell had a 4 cycle latency. Things may have changed (?).
 
That is not obvious. The local stores has a 7 cycle load-to-use latency. The PPE's caches are faster. Tasks that can be properly prefetched into local stores will also fit caches really well, the contrary is not necessarily true.

6 cycles read latency according to the docs.
The PPE's L2 latency will be higher than that, it's twice the size and has to go through an interace. The cache can hold more data but you can never be sure if it's there or not, the cache can get thrashed by other threads, the Local stores can't.

The Local stores lose capacity and automation but get higher speed and manual control in return.
 
The LS have 6 cycle latency, a previous Cell design had 64KB 4 cycle latency LS per SPU but that was before DD1, they said that going from 4 to 6 cycle latency didn't affect a lot the performance because they have 128x128bit GP Registers.

This is what a IBM developer said in the IBM Cell web forums.
 
ADEX said:
6 cycles read latency according to the docs.
The PPE's L2 latency will be higher than that, it's twice the size and has to go through an interace. The cache can hold more data but you can never be sure if it's there or not, the cache can get thrashed by other threads, the Local stores can't.

The L2 latency is higher, but the average load-to-use latency (L1+L2) will be lower for the PPE for just about any workload that suits the SPEs (ie. with spatial coherence).

Cheers
 
deathkiller said:
The LS have 6 cycle latency, a previous Cell design had 64KB 4 cycle latency LS per SPU but that was before DD1, they said that going from 4 to 6 cycle latency didn't affect a lot the performance because they have 128x128bit GP Registers.

This is what a IBM developer said in the IBM Cell web forums.

For Linpack and likewise, perhaps. For anything chasing pointers, it's 50% slower.

Cheers
 
Gubbi said:
The L2 latency is higher, but the average load-to-use latency (L1+L2) will be lower for the PPE for just about any workload that suits the SPEs (ie. with spatial coherence).

Cheers
It depends, if the L2 latency is 21 cycle and L1 data latency is 5 cycle in the PPE as it is in the XCPU cores you would need at least 93% of L1 32KB data cache hit rate to get better average.
 
maybe it's just how I've read the replies here, but I feel the overall important point the paper makes has been glossed over somewhat...

quotes such as these ring alarm bells for me:

'However, from a software-engineering perspective, the impact of porting some types of legacy game software is daunting.'

'Ultimately, this will require considerable redesign of the current game code.'

'It is possible to envision applications for which this [fuller optimisation for the SPEs] could be done without great difficulty, but our game was not one of them.'


In an age where game software complexity is already skyrocketing, you have a situation where the architecture practically enforces design styles and methods that make development more difficult...

Yes the Cell is a potentially extremly powerful peice of hardware, but judging by this paper it doesn't feel like a ballanced peice of hardware, writing a 200k line-of-code game in a couple of years is already spectacuarly difficult, let alone a game where a good percentage of that code needs to be tailored specificially for the architecture to achieve acceptable performance.

I can't help think about an interview I watched earlier today with Anders Hejlsberg of Turbo Pascal/Delphi/C# fame. His big point was simply that the only way forward is to simplify things for developers.

The performance of the PPE also dissapointments,
It is afterall, the conductor who bows when all is said and done.


*sigh* it's far too late and I'm in a very cynical mood.
 
deathkiller said:
It depends, if the L2 latency is 21 cycle and L1 data latency is 5 cycle in the PPE as it is in the XCPU cores you would need at least 93% of L1 32KB data cache hit rate to get better average.

You're right, just re-read the PPE details, the PPE (as does the XCPU) has a 5 cycle load-to-use latency from L1, I was expecting 4 cycles max. No wonder performance is pants.

When that is said, L1 hitrates are often in the 90s

Cheers
 
In an age where game software complexity is already skyrocketing, you have a situation where the architecture practically enforces design styles and methods that make development more difficult...

Yes the Cell is a potentially extremly powerful peice of hardware, but judging by this paper it doesn't feel like a ballanced peice of hardware, writing a 200k line-of-code game in a couple of years is already spectacuarly difficult, let alone a game where a good percentage of that code needs to be tailored specificially for the architecture to achieve acceptable performance.

I can't help think about an interview I watched earlier today with Anders Hejlsberg of Turbo Pascal/Delphi/C# fame. His big point was simply that the only way forward is to simplify things for developers.

Maximum performance is difficult and you need to understand the hardware to do it. This is true for *every* processor not just the consoles, this has always been true.

What's more it's going to get worse, previously you were guaranteed a doubling of performance every couple of years. The laws of physics have ended that. Now you've got to work if you want speed.

The laws of physics are also the reason the Xenon and Cell are designed the way they are, IBM would probably of much rather thrown together a bunch of G5s on a chip but they couldn't. The dual core G5s (2.0GHz) run cooler than the Xenon but their max throughput is only 32GFlops, Compare that to Xenon which has over twice that rating.

Fast, easy, cheap. Pick any 2.
 
ADEX said:
The laws of physics are also the reason the Xenon and Cell are designed the way they are, IBM would probably of much rather thrown together a bunch of G5s on a chip but they couldn't. The dual core G5s (2.0GHz) run cooler than the Xenon but their max throughput is only 32GFlops, Compare that to Xenon which has over twice that rating.

Fast, easy, cheap. Pick any 2.

(my bolding)

I think that's probably an oversimplification. The laws of physics don't, for example, dictate that you drop OOOE, chase Gigaflops and/or go for none symmetrical execution cores. The same laws of physics apply to AMD and Intel too, and they're following a rather different path in ther search for speed (and while I know they're designed to have different strengths, lets remember the role that we're interested in here - powering games).

It'd be intersting to see how Cell and Xenon performance compared in tests like this (particularly the more "game-representative" one) to current dual core Pentium and Athlon 64 processors, as they're closer in transistor count and die size iirc.
 
Last edited by a moderator:
function said:
(my bolding)

I think that's probably an oversimplification. The laws of physics don't, for example, dictate that you drop OOOE, chase Gigaflops and/or go for none symmetrical execution cores. The same laws of physics apply to AMD and Intel too, and they're following a rather different path in ther search for speed (and while I know they're designed to have different strengths, lets remember the role that we're interested in here - powering games).

It'd be intersting to see how Cell and Xenon performance compared in tests like this (particularly the more "game-representative" one) to current dual core Pentium and Athlon 64 processors, as they're closer in transistor count and die size iirc.

With cell you've in essence a ppu + a not too shabby cpu combined in one with excellent power/therm/costs that is better and more cost effective than shoving both a ppu and a traditional hot cpu into such a machine, and the perf probably exceeds the latter combination too(especially when taken to the same console cost/therm/power considerations.)..

Sony did this last time with the EE and despite the massive perf degradation thanks to the small caches, the lack of use of one of the units(vu0) for practically all the s/w, and having the other vu1 virtually always busy with gphx related calcs, thanks to the simplistic gs, pretty impressive things were done even dts surround sound on the cpu. It got pretty close to what was achieved with the sort of standard cpu solution a console could get by using a nonviable h/w budget, that was cooked up and served at a much later time of 18months, and also going beyond usual console h/w considerations judging by size. Had both gpus been equal in the features dept. there probably wouldn't have been any difference at all to the avg user.

Cell is in a much better position, with a full featured state of the art gpu, far better mem handling, and I'd guess many improvements over the EE design. Also it scales pretty high, if they were willing to do an xbox sized machine with exotic therm solutions they could probably go over 4Ghz, and shatter everything out there, but they don't have to.

As for the cache vs spe. What's the deal with the latest revision's spe's? 6 or 4? Devs hav e had these for a while and they should know the history, they would've complained by now if they believed there was any problem there, and certainly with all of these revisions changes would've likely been made(see psp 32MB.)
Gubbi said:
That is not obvious. The local stores has a 7 cycle load-to-use latency. The PPE's caches are faster. Tasks that can be properly prefetched into local stores will also fit caches really well, the contrary is not necessarily true.

Cheers
Gubbi
Do we know if this applies to the latest revision?
 
Last edited by a moderator:
You have a whole lotta love for the PS2, which is your cool, but I don't automatically share your faith that Cell is better than any alternative. Granted I don't know anything significant about writing games software or designing CPUs, but it appears that the more people do know, the less perfect it (and Xenon) seems to be.

zidane1strife said:
With cell you've in essence a ppu + a not too shabby cpu combined in one with excellent power/therm/costs that is better and more cost effective than shoving both a ppu and a traditional hot cpu into such a machine, and the perf probably exceeds the latter combination too(especially when taken to the same console cost/therm/power considerations.)..

I don't agree that the two solutions mentioned are in essence the same thing, or would provide the same results. And why does it have to be Cell or PC technology? MS went somewhere imbetween and there's no reason to think other balances are not possible.

Sony did this last time with the EE and despite the massive perf degradation thanks to the small caches, the lack of use of one of the units(vu0) for practically all the s/w, and having the other vu1 virtually always busy with gphx related calcs, thanks to the simplistic gs, pretty impressive things were done even dts surround sound on the cpu. It got pretty close to what was achieved with the sort of standard cpu solution a console could get by using a nonviable h/w budget, that was cooked up and served at a much later time of 18months, and also going beyond usual console h/w considerations judging by size.

I think developers have done some impressive things with the PS2 too, but the Xbox is clearly far ahead of it in terms of the quality of the visuals it can produce and the performce of the CPU. It's "pretty close" to the Xbox in the same way that the DC was "pretty close" to the PS2 despite also being nearly 18 months older, and having a much smaller development budget and manufacturing costs.

The "nonviability" of the Xbox hardware was largely down to the way MS purchased components, and so isn't a useful way to decide how good the components were at their jobs relative to each other. Clearly, Sony weren't unimpressed with the idea of putting a GeForce in a console. Also, how do you factor in features like a HDD in such a comparison?

Had both gpus been equal in the features dept. there probably wouldn't have been any difference at all to the avg user.

But how meaningful is this? How big would this GPU have been and who would have designed it? And what about system memory? It's a bit like people saying that if the DC had come out when the PS2 did, and with the same budget, it would have been more powerful.

Cell is in a much better position, with a full featured state of the art gpu, far better mem handling, and I'd guess many improvements over the EE design. Also it scales pretty high, if they were willing to do an xbox sized machine with exotic therm solutions they could probably go over 4Ghz, and shatter everything out there, but they don't have to.

How can you tell what processor speeds would be viable (4gHz+!) given an Xbox sized solution? Where do you get access to that kind of information? And how do you know that this processor would "shatter everything out there"? And in what kind of synthetic tests or real situations?

If a 2.4 ghz Cell can't substantially outperform a years old Pentium in this flop-heavy IBM test, how can you even suggest it would "shatter" something like an FX60 or beyond (and in the role of a console CPU)?

I'm sure there will be some great things done on the PS3 regardless of any of this, I just think this thread raises some interesting questions about what developers are going to get and lose in exchange for all this complexity.
 
Most of what I've said is my opinion, based on gut feeling and memory.

As I said in the past, from what I've seen, the cell is more impressive. Even opening and closing two or 3 HD vids in quick succession along with a few simple explorer windows can often cause massive slowdown in some of these modern cpus(>3000+), I just don't see them holding up to cell. For example, as I've said in many a thread, it's quite easy to bring a seemingly modern classic cpu to its knees with the physics in HL2 with the garry mod and a few slightly complex clone objects interacting with each other. In a similar situation the underclock cell sang like a bird, and handled fluid/cloth/projectile physics, 100+(looked like) of such objects, and advanced eyetoy functionality, probably the gphx related calcs too.

edited
 
Last edited by a moderator:
zidane1strife said:
Most of what I've said is my opinion, based on gut feeling and memory.

As I said in the past, from what I've seen, the cell is more impressive. Even opening and closing two or 3 HD vids in quick succession along with a few simple explorer windows can often cause massive slowdown in some of these modern cpus(>3000+), I just don't see them holding up to cell. For example, as I've said in many a thread, it's quite easy to bring a seemingly modern classic cpu to its knees with the physics in HL2 with the garry mod and a few slightly complex clone objects interacting with each other. In a similar situation the underclock cell sang like a bird, and handled fluid/cloth/projectile physics, 100+(looked like) of such objects, and advanced eyetoy functionality, probably the gphx related calcs too.

edited

I think in most cases opening and closing video windows in rapid succession would be an IO limited operation, so it wouldn't matter what processor was under the hood. In theory, if there were a dozen disks spinning up, Cell may possibly walk away with the contest--assuming that the SPEs can be efficiently dedicated to disk IO, which may be harder given their focus on fp computation and inability to directly access system memory. Can SPEs even take system interrupts?
 
3dilettante said:
I think in most cases opening and closing video windows in rapid succession would be an IO limited operation, so it wouldn't matter what processor was under the hood. In theory, if there were a dozen disks spinning up, Cell may possibly walk away with the contest--assuming that the SPEs can be efficiently dedicated to disk IO, which may be harder given their focus on fp computation and inability to directly access system memory. Can SPEs even take system interrupts?

Device extension model
The device extension model is a special type of function offload model. In this model, SPEs provide the function previously provided by a device, or more typically, act as an intelligent front end to an external device. This model uses the on-chip mailboxes, or memory mapped SPE-accessible registers, between the PPE ans SPEs as a command/response FIFO. The SPEs have the capability of interacting with devices, since device memory can be mapped by the DMA memory-management unit and the DMA engine supports transfer size granularity down to a single byte. Devices can utilize the signal notification feature of Cell to quickly and efficiently inform the SPE code of the completion of commands.
http://researchweb.watson.ibm.com/jo.../494/kahle.pdf

And yes, SPEs can access memory on their own via DMA.
 
Crossbar said:
http://researchweb.watson.ibm.com/jo.../494/kahle.pdf

And yes, SPEs can access memory on their own via DMA.

That's good, then. Though I was worried more that their inability to directly access memory like the PPE might cause a few stumbling blocks.

I wasn't up on how featured the DMA interface with the SPE was. If it allows the SPEs to interface with devices, it looks likely that they can get along well with a hard drive. If there are 7 disks, Cell would probably do very well in a range of situations.
 
Back
Top