Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

aaronspink · Jan 3, 2006

DeanoC said:
1) Random access - you need to localise or predict data you need to read OR write (writing is even more interesting than reading...).

Christ on toast! I hadn't even thought of that one yet. In the normal processor world we do a lot of things that effectively defer writes to cache/memory. In essence, with a modern processor, you should rarely if ever encounter a stall caused by a store. This is primarily because we keep track of them and handle all the nasty coherence and memory order model issues. In the DMA/LS model you can't really do this, as you need the data space that you will be storing into loaded in the LS. Now you can get away without doing this, but only if you can guarentee that your overwriting the whole thing which can be unfortunately pretty hard to actually program in.

2) Synchronisation - Its fairly slow and has a minimum node lockable size (the atomic unit has a large quanta, so you can't lock data less than that).

Is this documented anywhere? How big is large quanta? I could see how that could present problems for a lot of known parallized algorithims. Even when you have rare syncronization currently, you generally split the locking structure into small enough quanta that the probability of actually having a conflict is minimal.

3) Efficiency - Keeping the algorithm fast.

that damn Amdahl guy again.

Aaron spink
speaking for myself inc.

aaronspink · Jan 3, 2006

Titanio said:
On your previous comment about first gen games, for example, likely not using "parallelised" data structures - is this not something middleware takes care of anyway? Or do you maintain your own collision detection for a particular reason? Actually, I guess there's no guarantee the likes of AGEIA or Havok will have the best performance approach right off the bat anyway, for first gen titles, so..

This is always an issue of whether the data structure you need for what you are doing match with the data structures that Havok or AGEIA had in mind when they wrote their interfaces. I don't know the particular interfaces for AGEIA/Havok but I have done parallel programming in a middleware enviroment and it always seems that the various interfaces want things in different formats and you end up making trade-offs between dynamic conversion, parrallel structures (in this case keeping the data in two or more formats), or rolling your own for the simpler middleware interfaces.

Aaron Spink
speaking for myself inc.

scificube · Jan 4, 2006

I wan to ask a question about physics and flops again...

Gubbi ran some tests and found x, y, and z to be the case with games x, y, and z.

Doom3, Half Life2, and FarCry do utilize physics but I don't think they are on the same echelon as what next gen games will so I'm not sure the results found actually match up to what will happen down the road.

For instance Half Life 2 is touted for it's physics but really...just how complex is the physics sim going on? There are some objects lying around that you can toss here and there. Never really that many interactable objects on screen at once. All rigid bodies/hard objects. And some nice ragdoll for NPCs freshly fragged.

Friction on anything doesn't seem to be there. There no physics on particles emitted. There no fluid dynamics of any sort that I can see. There's not hair or cloth simulation that I've seen. There's no soft body physics for say muscle contractions. No physics based animation. There's certainly no fine grain stuff like bending a piece of metal or wire or say a collision altering the mesh of common wire fence seen all over the place in the game.

I'm not saying all of these things will be found in a next gen game but I do expect a mixture of them to show up regularly in second and third gen titles. My question is whether the scale of physical interactions used, the quality of physical interactions used, and the new types of physic sims (well not new...newly utilized is more correct) will shift what cause s a bound on physics in games more towards the computational side of things i.e. how many flops you can toss at the problem?

I think there will be ever more conditional checking, memory accessing etc but I think those flops will surely find a home in physics at the same time if some brave developers take the initiative here and try to make good on some of the buzz swirling around what these CPUs are capable of.

I'm not sure I made the question clear so let me try again. In light of where physics is going in games...not what we've seen to date...isn't it fair to say the importance of available flops is greater than in the past?

I just don't think drawing conclusions about future flops usage by physics from past games like HL2, FarCry, Doom3 etc. is sound when the amount of and type of ops consumed explicitly by physics is difficult or maybe impossible to ascertain and at the same time most "next gen" physics aren't there to have an affect on the numbers.

As for profiling Aegia's SDK...so far it's still all a bunch of rigid bodies no? And the lack of SIMD ops seems kinda weird, but really I'm no authority on any so maybe I can't look at that and go hmmm...just kinda seems strange SIMD isn't being utilized more when their chip is supposed to be similar to Cell which has SIMD units all over the place.

It would certainly seem strange to me that Aegia wouldn't tweak offerings for the console makers to use SIMD ops because that a whole lot of execution unit lying around that would seem more than happy to take on some good number crunching.

Well...this looks like a big ole flops matter pro-Cell post but it isn't...even though I am pro-Cell...and Xenon for that matter. I'm just questioning the value of Gubbi's findings before we go off and accept that physics don't really need the flops these CPUs provide. No offense Gubbi of course cause the work done is still good work..I'm just being contrary...hopefully for good reason.

ERP · Jan 4, 2006

scificube said:
A LOT OF STUFF

Obviously I can only guess here......
The instruction mix is probably indicative of what you'll see, probably more load bound than Gubbi's tests if anything.
IME with intel SIMD, you get at best a 10-20% performance improvement because of all the data shuffling you have to do, and the relatively simple pipelining of the SIMD units. And you might not even get that with the newer Intel processors.

The AGIEA test is a very extreme example, and it's unlikely you'll approach it in instruction mix.

As was stated in the original post this doesn't mean that Cell or Xenon for that matter are not good at physics, it just means that physics isn't entirely dependant on peak FLOPs. This isn't really surprising when you consider what work a collision system has to do.

Isolated procedural effects like cloth simulation and water simulation would show heavier FPU usage. Complex connected bodies with large constraint sets would also be more FPU heavy, although that may be harder to optimise for a vector unit.

scificube · Jan 4, 2006

I'm not saying physics is 1:1 with flops either. I think the scale and future variation of physics do make flops more not less important. The reverse just isn't making sense to me yet.

ERP · Jan 4, 2006

scificube said:
I'm not saying physics is 1:1 with flops either. I think the scale and future variation of physics do make flops more not less important. The reverse just isn't making sense to me yet.

OK think about it this way....

I have 1000 objects with physics in my scene assuming they don't hit each other or anything else I'm basically calculating a = F/M, the angular equivalent and integrating.... Not exactly a lot of Flops required.

However I do have to determine if any of them are actually colliding, the stupid version of this requires NxN tests, but I'm going to use a tree and do bounding box checks, that requires a LOT of memory accesses, potentially a branch or 2 and trivial FPU performance.

Now if they are colliding I have a lot more work to do from the FPU standpoint.......

BUT and this is the important part, for the most part I have a very small portion of my objects responding to a collision relative to the number that I' moving. If I don't then I am totally F*cked.

When objects are not in motion, they are generally asleep and just doing collision tests, when objects are in motion on a surface, they are applying a constraint to their motion and not "colliding".

So in general in a simple rigid body model I do a lot of walking data structures with minimal FPU usage and short bursts of concentrated FPU performance during collision resolution.

Now there are cases that are more FPU dependant, I outlined some of them above, and I'm sure it's something we'll see more of.

scificube · Jan 4, 2006

Thanks ERP! That wa a very helpful post. I understand how physics can still be a costly thing just if for nothing else just trying to keep track of all that's going on. I guess I'm just expecting a whole lot more collisions in next gen games and thus more physics calculations resulting from them but what you say certainly puts all that into perspective. I guess I can only say that literally flops matter only some of the time during a physic sim but still without them that some of the time may be a painful experience.

While the number of heavy physic calculations fluctuates with the number collisions/active forcees the cost of checking whether those physics calculations should be performed in the first place and what type can be looked at as something of a constant cost. I thnk I learned something so if I have to thank you for the lesson...so thank you

DeanoC · Jan 7, 2006

aaronspink said:
Is this documented anywhere? How big is large quanta? I could see how that could present problems for a lot of known parallized algorithims. Even when you have rare syncronization currently, you generally split the locking structure into small enough quanta that the probability of actually having a conflict is minimal.

1 cache line. The atomic unit always locks an entire cache line.

purpledog · Jan 7, 2006

nAo said:
Yep, but EIB (and XDR) can't obviously sustain that.
That huge LS bandwith is there to make sure DMA read/writes and intruction fetches don't significally stall code running on a SPE.

I thought DMA read/write was done independantly from the program execution. I guess you're talking about higher-level double buffering techniques (or similar) here ?

Gubbi · Jan 7, 2006

purpledog said:
I thought DMA read/write was done independantly from the program execution. I guess you're talking about higher-level double buffering techniques (or similar) here ?

There are three sources of LS traffic:

1.) Instruction fetch
2.) DMA traffic
3.) Loading and storing register values.

A SPE can issue 2 instructions per cycle using 8 bytes (64bit) of bandwidth per cycle. DMA transfers produce 128 bit/(every other?) cycle , loading and storing registers can be 128 bit/cycle.

The LS is single ported, that means it can serve one request per cycle, either a read or a write. Having three sources of traffic and only one port in your memory array is a recipe for contention. To get around this IBM built the LS to have massive bandwidth. In fact 1024 bit can be read from or written to the LS every cycle.

The contention caused by instruction fetches are then alleviated by reading 1024 bits, or 32 instructions, at a time. This means that instead of fetching instructions every cycle the fetch unit only needs to fetch every 16th cycle (assuming no branching) from the LS.

Similarly DMA transactions are coalesced into 1024 bit transactions reducing LS requests to 1/8th.

This leaves many more cycles to the load/store unit to do real work.

So you have a LS with 1024bit/cycle (~380GB/s) bandwidth. But you only ever really use 128(load/store) + 64(average instruction fetch) + 64bit(DMA)bits /cycle for a total of ~95GB/s bandwidth at a maximum.

Bottom line: The massive bandwidth is really used to lower contention and avoid making a multiported LS, but is of course great for marketing.

Cheers
Gubbi

purpledog · Jan 7, 2006

Gubbi said:
So you have a LS with 1024bit/cycle (~380GB/s) bandwidth. But you only every really use 128(load/store) + 64(average instruction fetch) + 64bit(DMA)bits /cycle (~95GB/s) bandwidth at a maximum.

I see, it uses the fact that DMAs and instructions LS access are easy to group to give more flexibility (bandwith) to registers load/store, which are smaller and more random.
Thanks for this clear explanation.

Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

aaronspink

aaronspink

scificube

ERP

scificube

ERP

scificube

DeanoC

Trust me, I'm a renderer person!

purpledog

Gubbi

purpledog

Similar threads