Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

MrWibble said:
There are loads of algorithms where currently we access memory with wild abandon, simply because it was easiest to write them that way and it's not too bad on a typical CPU (though it's never the most sensible thing to do). However that's not to say that they *have* to be done that way.
One of the goals in the design of Cell is to support "wildly-abandoned gather".

The memory flow controller (I think it's the MFC, one per SPE) is like a small CPU that can access LS. In LS it finds lists of memory addresses and how much data is to be fetched (and where to put it - from main memory to LS or vice versa, or from one LS to another LS).

It then works out the best way to perform those tasks and generates DMA tasks for its own private DMA unit.

What's unclear to me is how predictable the latency of gathers and scatters is, and whether it's possible for the developer to program SPEs in a way that's tolerant of the unpredictable latency associated with "wildly abandoned gathers".

That's not to say that a traditional L2 cache is better - it seems to me it's going to cause the thread to grind to a halt too while waiting for the gathers to turn up.

I think the fact that LS isn't restricted to an "n-way" associativity that's normal for caches gives programmers the chance to control when data is loaded and "evicted" from LS in a more fine-grained fashion than caches normally support.

The other side of the coin is the coding overhead in order to maintain LS data - something that's normally performed with significantly more hardware support.

http://www-128.ibm.com/developerworks/power/library/pa-celldmas/

Jawed
 
Last edited by a moderator:
ihamoitc2005 said:
Because of silly claim that SPE=DSP, he only has PPE integer capability in comparison. This article is good lesson in advertising.

As far as I know all DSP's can do integer work, but like the SPE's on CELL, they excel at either fixed point or flointing point math, and thus people make the wrong assumption it's not very good at integer work. DSP's like the SPE's tend to be simpler processors, and so calling the SPE's DSP's seems fine by me, but I guess IBM felt their were enough differences to call them something else.

Anyone who thinks the SPE's cannot do integer work, should study the SPE instruction set at IBM's site. The SPE's have a full integer instruction set.

I think most people have discredited Major Nelson's article as a lot of nonsense, and too bad sites like IGN hosted that article.
 
Last edited by a moderator:
ihamoitc2005 said:
Here is link to very funny article where false image of SPE=DSP is created.
http://www.majornelson.com/2005/05/20/xbox-360-vs-ps3-part-1-of-4/

Because of silly claim that SPE=DSP, he only has PPE integer capability in comparison. This article is good lesson in advertising.

Oh then so there's really no problem? I was thinking mostly about MS claim for 80% integers bla bla.

But would it be possible, to put all the operations that really require integers on the PPE, and focus the SPE's on very advanced physics, AI, graphics and so on?

Is it possible to claim that XCPU has too much integer power, if you look at it as a float/integer ratio?
Might also be so that Cell also has better integer power :devilish: as MrWibble put it, but MS hasn't really neglected that they suck at flops.
 
I was always under the impression that typically the (vast) majority of execution time in a game is spent working with floating point data (whatever about code volume). Without speaking about what might be typical in next-gen games (with heavy physics, for example - look at what happens to a dual-core P4 when it starts throwing about thousands of rigid bodies, for example, as in AGEIA's presentation demos).
 
Both

Edge said:
As far as I know all DSP's can do integer work, but like the SPE's on CELL, they excel at either fixed point or flointing point math, and thus people make the wrong assumption it's not very good at integer work. DSP's like the SPE's tend to be simpler processors, and so calling the SPE's DSP's seems fine by me, but I guess IBM felt their were enough differences to call them something else.

Anyone who thinks the SPE's cannot do integer work, should study the SPE instruction set at IBM's site. The SPE's have a full integer instruction set.

I think most people have discredited Major Nelson's article as a lot of nonsense, and too bad sites like IGN hosted that article.

I think DSP is usually one or other not both but SPE is full power for both. Also SPE has many other features like LS, DMA unit so it is like independent processor.
 
weaksauce said:
Is it possible to claim that XCPU has too much integer power, if you look at it as a float/integer ratio?
Might also be so that Cell also has better integer power :devilish: as MrWibble put it, but MS hasn't really neglected that they suck at flops.
You're thinking about this all wrong, getting caught up in the hardware people's number games. There can't be too much integer power, or floating point power.

Developers write software to run on a given hardware, and this software needs to do various things like main loop, IO, AI, graphics rendering, etc. The devs use whatever resources are available. In an Int powerful processor, they'll use Integer algorithms. If the CPU is FP strong, the devs will look to use FP algorithms. There's many ways to solve a given problem and devs will use whatever they can to do this, working around the limitations that will always exist in one format or other*. The only way you can have too much Int or FP performance is if the supporting features, such as memory access rate, are insufficient to keep the calculation units to running.

You can't sum up a processor and it's relative performance for all things versus other processors on a couple of peak metrics and arbitary percentage figures.

Edit : There are some things that need a certain data type, like 3D graphics and Physics really needs FP, while data fetches need Int memory pointers. But if your CPU is limited in some such respects, you just cut back on those features. We still had games back in the 80s without intense FP performance. In the context of XeCPU and Cell, both have ample ability to crunch float and int numbers.
 
Last edited by a moderator:
Integer

weaksauce said:
Oh then so there's really no problem? I was thinking mostly about MS claim for 80% integers bla bla.

Maybe sometimes high code percentage is integer but actual processing time percentage for games is very little integer and very high floating point. But even if someone makes new kind of game with 99% integer, CELL has more integer speed. For normal games with high float need, CELL has more float speed. It is much fasteer processor.

But would it be possible, to put all the operations that really require integers on the PPE, and focus the SPE's on very advanced physics, AI, graphics and so on?

Developer can distribute task as they prefer. Maybe sometimes it is better for some integer on SPE's and sometimes better for integer on PPE.

Is it possible to claim that XCPU has too much integer power, if you look at it as a float/integer ratio?

Always better to have more of everything no?

Might also be so that Cell also has better integer power :devilish: as MrWibble put it, but MS hasn't really neglected that they suck at flops.

Yes. MS made mistake to say xbox360 CPU has less flops. I think like integer issue they should make up story that SPE is not good for flops (MS should say SPE only good for mpeg2 or audio decode) then they can say they have 3x flops and 3x integer (but really it is less for both).
 
About the too much power I was thinking it as a ratio between flops and integer, either high integer and low float, or the other way. And then I was thinking if newer games would be better on a float based cpu, because of the physics and AI.
But hey I don't much about this so thanks for answering. :)
 
Titanio said:
I was always under the impression that typically the (vast) majority of execution time in a game is spent working with floating point data (whatever about code volume). Without speaking about what might be typical in next-gen games (with heavy physics, for example - look at what happens to a dual-core P4 when it starts throwing about thousands of rigid bodies, for example, as in AGEIA's presentation demos).

This has come up before. I believe someone used some AMD code analyzer thing and found that in current games, only about 1% of the execution time is spent on floating point. (though that just may be the average when not much is happening, when physics engines come into play then perhaps floating point my jump to 99% of the execution time, then settle down again afterwards)
 
Fox5 said:
This has come up before. I believe someone used some AMD code analyzer thing and found that in current games, only about 1% of the execution time is spent on floating point. (though that just may be the average when not much is happening, when physics engines come into play then perhaps floating point my jump to 99% of the execution time, then settle down again afterwards)

Of course as a snobbish console programmer I could also argue that PC programmers are just sloppy and write inefficient code that takes ages to do anything useful :)
 
Fox5 said:
This has come up before. I believe someone used some AMD code analyzer thing and found that in current games

That's a pretty broad statement..in all current PC games? In a sampling tested, perhaps. That's still a very surprising, though there's no accounting for the issues MrWibble raises (and others, for example the avoidance of FP usage for performance reasons even where it otherwise might be desireable) ;) Can you give me a link to more info, anyway?

As I said though, irrespective of the state of current games, I'd suggest that going forward, floating point-intensive tasks would be the dominant presence. There's a reason, for example, why things like PhysX chips are appearing in the PC space now. You'd also have to ask yourself why the likes of Sony and, yes, MS focussed their chips so much on FP capability if it would typically be for the sake of 1% of execution time. They could have as easily optimised for integer, for example, but they didn't. I'd also refer to comments by the likes of Tim Sweeney, suggesting that typically things that a SPE wouldn't be very useful for only take a small fraction of execution time anyway - and given SPE's focus on FP, that might be a relevant comment here.
 
Last edited by a moderator:
ihamoitc2005 said:
I think DSP is usually one or other not both but SPE is full power for both. Also SPE has many other features like LS, DMA unit so it is like independent processor.

Maybe you should do some checking on DSP's. As far I know, all DSP's are integer processors, and are very good at fixed point or floating point math. You're right that some DSP's do not have floating point units.

Many DSP's also have local stores using SRAM and also have DMA engines.

The main difference the SPE's have over typical DSP's is high clock rate, as no DSP that I know of runs anywhere near 3.2 GHz, or has a LS as big as 256 KB. Texas Instruments highest clocked DSP runs at around 1 GHz. One of the goals for CELL's SPE's was high clock rate.
 
Last edited by a moderator:
Dsp

Edge said:
Maybe you should do some checking on DSP's. As far I know, all DSP's are integer processors, and are very good at fixed point or floating point math. You're right that some DSP's do not have floating point units.

Many DSP's also have local stores using SRAM and also have DMA engines.

The main difference the SPE's have over typical DSP's is high clock rate, as no DSP that I know of runs anywhere near 3.2 GHz, or has a LS as big as 256 KB. Texas Instruments highest clocked DSP runs at around 1 GHz. One of the goals for CELL's SPE's was high clock rate.


Thank you I will read on this.
 
Titanio said:
That's a pretty broad statement..in all current PC games? In a sampling tested, perhaps. That's still a very surprising, though there's no accounting for the issues MrWibble raises (and others, for example the avoidance of FP usage for performance reasons even where it otherwise might be desireable) ;) Can you give me a link to more info, anyway?

As I said though, irrespective of the state of current games, I'd suggest that going forward, floating point-intensive tasks would be the dominant presence. There's a reason, for example, why things like PhysX chips are appearing in the PC space now. You'd also have to ask yourself why the likes of Sony and, yes, MS focussed their chips so much on FP capability if it would typically be for the sake of 1% of execution time. They could have as easily optimised for integer, for example, but they didn't. I'd also refer to comments by the likes of Tim Sweeney, suggesting that typically things that a SPE wouldn't be very useful for only take a small fraction of execution time anyway - and given SPE's focus on FP, that might be a relevant comment here.

All games weren't tested, but Halflife 2 was (one of the more physics heavy games) along with 1 or 2 others. Don't remember the link, but it was a post on this board.
Maybe current games aren't floating point heavy because the current hardware isn't good at it. (besides physics, what else needs floating point? graphics, but the gpus handle that)
 
Fox5 said:
This has come up before. I believe someone used some AMD code analyzer thing and found that in current games, only about 1% of the execution time is spent on floating point. (though that just may be the average when not much is happening, when physics engines come into play then perhaps floating point my jump to 99% of the execution time, then settle down again afterwards)

That was probably me. I profiled HL2, Doom 3 and Far Cry. All were in the 15-25% range of fp ops (retired x87/ retired x86). HL2 the least, D3 and FC about equal (shadow volume extrusion in D3 vs more physics in FC making it a toss up).

All used almost exclusively x87 code (Far Cry was the game that used the most SIMD in some of it's dlls).

Anyway

Running Ageia's PhysX SDK on my X2 4400 with the most FP heavy demo I could find in the SDK (samplemeshmaterials with multiple instances of the big pyramid stack of boxes) and profiling it, I got:

FPU retired ops of retired x86 ops total: 51.65%, surprisingly >99% are x87, not SIMD (3DNow or SSE).

Number of instructions causing data cache accesses: 68.74% (quite a lot, x87 register pressure shining through ??)

The three FPU pipelines had this breakdown:

x87 add: 17.26% of total, 30.02% of total FPU ops
x87 mul: 17.10% of total, 29.75% of total FPU ops
x87 store: 23.15% of total, 40.27% of total FPU ops

Margin of error <1%.

Interestingly only 60% of the 51.65% FPU ops do arithmetic (31% of total). Remember these are x87 ops, with SIMD it should be possible to have less FP instructions (even with more ops).

The demo was completely contained in caches:

D$ misses: 1.39% !!! (ie 98.61% hitrate in the 64KB 2-way assoc D$)
L2 hits: 1.37% !!! (ie. only 0.02% going to main memory)

Of all instructions 8.90% were branches.

This is in line with previous findings IMO. If the physics engine were optimized for SIMD I'd expect the fraction of FP instruction to decrease (even as ops are increasing), making the fraction of other instructions higher (branches in particular are nasty).

My guess is that physics engines are not likely to be bound by FPU capacity but rather something else (branch and load/store.)

This does not mean that SPUs are useless for physics, far from it. - And they still have crazy vector/linear algebra performance which will be very useful (expect spectacular cloth and water and what not simulations)

Cheers
Gubbi
 
Last edited by a moderator:
Gubbi said:
FPU retired ops of retired x86 ops total: 51.65%, surprisingly >99% are x87, not SIMD (3DNow or SSE).
Now that is interesting considering they are using the SDK to promote their chips, it isn't so supprising.

Number of instructions causing data cache accesses: 68.74% (quite a lot, x87 register pressure shining through ??)
Well, when your architecture is primarily a memory-memory architecture, you'd expect a lot of memory accesses. To the point that some refer to the L1 cache on an x86 as being the "big register file" and the actual registers as being the accumulators (which they basically are). If you profiled on a non-x86 architecture, it is quite likely that you would see a significantly lower % of FPU ops vs non-FPU ops because of all the additional load/store traffic.

The demo was completely contained in caches:

D$ misses: 1.39% !!! (ie 98.61% hitrate in the 64KB 2-way assoc D$)
L2 hits: 1.37% !!! (ie. only 0.02% going to main memory)

Of all instructions 8.90% were branches.

As much as people like to malign caches, for HPTC type code, they can work pretty well. In the cases where they don't, you are either memory bandwidth limited or memory latency limited (ex streams and linked lists). In the case of stream like workloads, there is little you can do but to increase the number of outstanding memory accesses and the actual realized bandwidth of the memory system. For the linked list type cases, the only option is to either decrease the memory latency or have a very very good prefetcher that trades off usable bandwidth for latency.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
Well, when your architecture is primarily a memory-memory architecture, you'd expect a lot of memory accesses. To the point that some refer to the L1 cache on an x86 as being the "big register file" and the actual registers as being the accumulators (which they basically are). If you profiled on a non-x86 architecture, it is quite likely that you would see a significantly lower % of FPU ops vs non-FPU ops because of all the additional load/store traffic.

That's what I figured as well. The adds and muls must be of the [reg],st type, the large amount of stores seem to indicate register spills too.

Cheers
 
Titanio said:
Very interesting Gubbi, thanks very much. Is it able to tell you execution time?

Well, no, the demo runs indefinately, I used 40-60 second traces. While not wholely scientific the instruction ratios seem to be consistent across multiple runs (did a whole bunch, Codeanalyst only allows to trigger on 4 different events at a time)

Cheers
 
Last edited by a moderator:
Back
Top