Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

mckmas8808 said:
Sooo Gubbi basically what does that mean for the PS3 games using lots of physics?

Nothing at all. It just means you can't correllate peak flops directly to physics performance.

Cheers
 
Last edited by a moderator:
aaronspink said:
As much as people like to malign caches, for HPTC type code, they can work pretty well. In the cases where they don't, you are either memory bandwidth limited or memory latency limited (ex streams and linked lists). In the case of stream like workloads, there is little you can do but to increase the number of outstanding memory accesses and the actual realized bandwidth of the memory system. For the linked list type cases, the only option is to either decrease the memory latency or have a very very good prefetcher that trades off usable bandwidth for latency.
SPE addresses those cases with LS. The latency is low, and a programer may be able to design a good app-specific prefetcher.

That said, will a game programer throw a sloppy, abstract objects like linked list to SPE? If the exotic configuration of Cell forces developers to think about optimization in early stages, it's not that bad.
 
one said:
That said, will a game programer throw a sloppy, abstract objects like linked list to SPE? If the exotic configuration of Cell forces developers to think about optimization in early stages, it's not that bad.

If you want physics on your SPUs, you'll need a space decomposition structure (for collisions), that's likely to be an oct-tree or a K-D tree. A tree regardless.

Trees are made up of nodes, linked together by pointers. If the tree can be contained in the LS, fine, if not.... :(

Cheers
 
Gubbi said:
If you want physics on your SPUs, you'll need a space decomposition structure (for collisions), that's likely to be an oct-tree or a K-D tree. A tree regardless.

You can make these structures very compact. My memory's a little hazy, but IIRC in a K-D tree, you can represent a node with 32-bits (4 bytes) (leaf contents are stored elsewhere..but with compression and the structure i'm thinking of here, you could simultaneously store about 4,000 nodes and 38,000 vertices given 128KB of data. For leaf contents, you'd be preloading/prefetching..there are easy ways to do this, and then perhaps smarter ways. Indeed, given hardware cache control or none at all, it may be beneficial to implement a software cache for something like this regardless). I'm not sure how typically large a tree is, though, I'll admit. There's possibly also the potential for splitting a tree and handing different branches to different SPUs for searches and the like?
 
Last edited by a moderator:
Ai

Comment by Hofstee on CELL AI:

http://www-128.ibm.com/developerwor...93069&message=13750568&cat=46&q=tree#13750568

With respect to AI, I don't have the answers, but I have good hope. I think AI is usually not bound by computation but by memory access penalties (on high-frequency processors). I can imagine tree search algorithms for the SPEs that absolutely rock, by getting a lot of memory accesses in flight concurrently. There may be a patent out there by M. Necker and myself that describes some of this for the case of routing table accesses ( also a kind of tree search ).

Much to learn of CELL on IBM "Developer Works" forum site.
 
Gubbi said:
If you want physics on your SPUs, you'll need a space decomposition structure (for collisions), that's likely to be an oct-tree or a K-D tree. A tree regardless.

Trees are made up of nodes, linked together by pointers. If the tree can be contained in the LS, fine, if not.... :(
According to DemoCoder in this thread (your previous benchmark test is in it too)
http://www.beyond3d.com/forum/showthread.php?t=21328&page=3
some trees are streamable. So I think it depends.
 
one said:
SPE addresses those cases with LS. The latency is low, and a programer may be able to design a good app-specific prefetcher.

No, it doesn't. the SPE will choke just as much as most other architectures on Stream. When you are memory bandwidth limited, you are memory bandwidth limited, the LS doesn't change that. And the LS won't help out in linked list cases. In fact, if you are really chasing through say a 16 Meg LL structure, the LS will probably have negative impact.

The LS only provides an advantage IF and ONLY IF, you can reasonably fit the data set size within the LS. In those cases, the LS can provide an advantage, but if you have a data structure that will fubar an equal sized cache, it will also fubar the LS. the LS is effectively a software managed cache, there is plenty of research out there on the strengths and weaknesses of such designs.

That said, will a game programer throw a sloppy, abstract objects like linked list to SPE? If the exotic configuration of Cell forces developers to think about optimization in early stages, it's not that bad.

You think LL happen because people want them to happen? Yeah, ideally every data structure would be small, have perfect spacial locality and excellent temporal locality, but back to reality, sometime data structures just aren't what you want them to be, there are always significant tradeoffs, for instance sparse matrix structures.

So what could end up happening is, some nice algorithims, that you can get great performance for, just so long as you don't have too many objects.

Aaron Spink
speaking for myself inc.
 
Last edited by a moderator:
Titanio said:
You can make these structures very compact. My memory's a little hazy, but IIRC in a K-D tree, you can represent a node with 32-bits (4 bytes) (leaf contents are stored elsewhere..but with compression and the structure i'm thinking of here, you could simultaneously store about 4,000 nodes and 38,000 vertices given 128KB of data. For leaf contents, you'd be preloading/prefetching..there are easy ways to do this, and then perhaps smarter ways. Indeed, given hardware cache control or none at all, it may be beneficial to implement a software cache for something like this regardless). I'm not sure how typically large a tree is, though, I'll admit. There's possibly also the potential for splitting a tree and handing different branches to different SPUs for searches and the like?

And this is where parallelism starts to get interesting.
You can devise data structures that will let you split collision sets amongst SPE's but as some cost to maintaining and referencing those data sets, you may even speculatively do redundant work. The point is nothing is free, for most tasks by increasing the parallelism you invariable reduce the efficiency of a single thread.
 
one said:
According to DemoCoder in this thread (your previous benchmark test is in it too)
http://www.beyond3d.com/forum/showthread.php?t=21328&page=3
some trees are streamable. So I think it depends.

They are not.

Yes you can use high fan-out trees to trade bandwidth for latency, but it very quickly becomes uneconomical. It's already done though, it's not uncommon to pack trees so that individual subtress fit into cache lines to exploit locality (and avoid cache line alignment issues).

If you compare a subtree in a binary tree (the simplest) to a B-tree you have log2(n) levels in the binary tree and hence only needs to visit log2(n) nodes. Doing a linear search in the B-tree-node will on average require n/2 compares (well, not really if you pack the individual elements in the BT-node as a regular tree) but more significantly you'll load up n elements, using n-log2(n) times more bandwidth than the binary tree, this is bad for large n. The only hope for recooping some of that is if you can do lots of searches that will reuse the otherwise wasted bandwidth/nodes (which incidentally there's a good chance of happening).

This means that the sweet spot for the size of high fan-out tree nodes normally is rather small.

Cheers
 
Last edited by a moderator:
ERP said:
And this is where parallelism starts to get interesting.
You can devise data structures that will let you split collision sets amongst SPE's but as some cost to maintaining and referencing those data sets, you may even speculatively do redundant work. The point is nothing is free, for most tasks by increasing the parallelism you invariable reduce the efficiency of a single thread.

That sounds interesting.

So how do you partition this on the various SPEs ? Do you maximize spatial locality? Or do you distribute data, so that the SPEs are equally loaded ?

Cheers
Gubbi
 
aaronspink said:
The LS only provides an advantage IF and ONLY IF, you can reasonably fit the data set size within the LS.

Totally false, as IBM's landscape demo shows while dealing with a matrix 128 MB in size. It could not even come close to fitting in LS memory, and certainly had no problems handling it.

Note the SPE's were also handling 200 MB of textures, so in total over 300 MB of data.
 
Last edited by a moderator:
Edge said:
Totally false, as IBM's landscape demo shows while dealing with a matrix 128 MB in size. It could not even come close to fitting in LS memory, and certainly had no problems handling it.

Note the SPE's were also handling 200 MB of textures, so in total over 300 MB of data.

The 128 MB matrix isn't the working data set size. At any given time quantum the program is only dealing with a fraction of the matrix as its working set size. Just like you don't load all of a 4kx4k texture, you only load the portion that you actually need. For the terain render, they have a significant amount of temporal locality.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
The 128 MB matrix isn't the working data set size. At any given time quantum the program is only dealing with a fraction of the matrix as its working set size. Just like you don't load all of a 4kx4k texture, you only load the portion that you actually need. For the terain render, they have a significant amount of temporal locality.

That's right a solution is found by breaking the data into chucks. Not every problem will have this as a solution, but then again you do have that PPE to work with also.
 
Blocks

Edge said:
That's right a solution is found by breaking the data into chucks. Not every problem will have this as a solution, but then again you do have that PPE to work with also.

Yes. This is precise method.

http://www.research.ibm.com/cell/whitepapers/TRE.pdf

Where the
plane intersects the view screen are the locations of
the accumulation buffer to be modified [figure 1].
Where the plane intersects the height map are the
locations of the input height data needed to process the
vertical cut of samples [figure 2]. The PPE
communicates the range of work blocks each SPE is
responsible for via base address, count, and stride
information once per frame. The SPE then uses this
information to compute the address of each work
block which is then read into local store via a DMA
read and processed. This decouples the PPE from the
vertical cut by vertical cut processing allowing it to
prep the next frame in parallel. The SPEs place the
output samples in an accumulation buffer using a
gather DMA read and a scatter DMA write. Each SPE
is responsible for four regions of the screen and the
vertical cuts are processed in a round robin fashion,
one vertical cut per region, and left to right within a
each region, so no synchronization is need on the
output as no two SPEs will ever attempt to modify the
same locations in the accumulation buffer even with
two vertical cuts in flight (double buffering) per SPE.
The SPEs are therefore free to run at their own pace
processing each vertical cut without any data
synchronization on the input or output. None of the
SPE’s input or output data is touched by the PPE,
thereby protecting the PPE’s cache hierarchy from
transient streaming data.
 
ERP said:
And this is where parallelism starts to get interesting.
You can devise data structures that will let you split collision sets amongst SPE's but as some cost to maintaining and referencing those data sets, you may even speculatively do redundant work. The point is nothing is free, for most tasks by increasing the parallelism you invariable reduce the efficiency of a single thread.

I was thinking if you split the tree across a number of SPEs, gave different branches to different SPEs, there would be some redundant queries where a single SPE covering the whole tree would have stopped. But it seems the proportion would be small.

Gubbi said:
That sounds interesting.

So how do you partition this on the various SPEs ? Do you maximize spatial locality? Or do you distribute data, so that the SPEs are equally loaded ?

If you can fit a representation of your entire tree on one SPE, you could simply split your query objects amongst multiple SPEs. I guess the other option would be to hand different branches to different SPEs - but some branches may be more interesting than others, and thus some SPEs might be busier than others if your tree isn't "balanced" from that perspective. Another option might be to give each branch to every SPE and split the query objects amongst them. Or to give each branch not to every SPE, but multiple SPEs, and split the query objects appropriately.

To refine what I was saying earlier about kd-tree representation..I suppose for the leaf nodes, you'd only need to store enough data to kickstart the SPE once it got down to that level - enough vertices to sustain it until more could be streamed in. The starting memory address, the size of the data, and enough to keep it busy till you stream in more. Ditto, you might want to keep multiple query objects in LS - enough to keep it going till more could be streamed in.

Anyone have any idea how typically big a tree is? Or any guesses as to how big a tree might get? Number of nodes, number of leaf nodes?
 
Last edited by a moderator:
Gubbi said:
So how do you partition this on the various SPEs ? Do you maximize spatial locality? Or do you distribute data, so that the SPEs are equally loaded ?
If I look at our collision model at the moment, I think I'd want to run first couple box tests on PPE and then offload the relevant subtrees to SPE. Just need to reorganize data so subtrees smaller then "somesize" are stored linearly in memory.
And we already have primitive tests that follow box checks in streamable form (just gonna be simpler now since I can offload the entire sequence, VU couldn't do that completely on its own).

Obviously that's not a generic solution, just happens to work with this particular workload.
 
Edge said:
That's right a solution is found by breaking the data into chucks. Not every problem will have this as a solution, but then again you do have that PPE to work with also.
Of course they picked a trivially chunkable algorithm... graphics is known to be embarrasingly parellel.
Its a lot harder for real algorithms...
You have to juggle
1) Random access - you need to localise or predict data you need to read OR write (writing is even more interesting than reading...).
2) Synchronisation - Its fairly slow and has a minimum node lockable size (the atomic unit has a large quanta, so you can't lock data less than that).
3) Efficiency - Keeping the algorithm fast.
4) Control overhead - You need something to control it, using the PPE can be interesting (remember its having to run an entire game as well, any control system that require fast response is hard (though not impossible) ). Using SPEs can be even more interesting (a simple example is SPU controlled co-operative multi-tasking , works well until one task require some PPE code to be run as prolog and epilog to the task)

Its fun but hard... IMHO Things like shared tree structures will only start to be used in 2nd and 3rd generation games.
 
Titanio said:
Anyone have any idea how typically big a tree is? Or any guesses as to how big a tree might get? Number of nodes, number of leaf nodes?
Its just another complication... we already modify our efficient tree structures for the platform we are running on (see Faf last post. things like cache line size, VU optimised workloads etc.). So its not a new problem, just a different set of issues to be solved.

What is new is that more work than collisions and graphics want to be optimisation, so you have an even wider set of algorithms to look at.
 
DeanoC said:
Its just another complication... we already modify our efficient tree structures for the platform we are running on (see Faf last post. things like cache line size, VU optimised workloads etc.). So its not a new problem, just a different set of issues to be solved.

I guess it is more of an educational query, for me. I can read a book, and see what's doable from that end, for example, but without experience of typical requirements I'm kind of left guessing as to how effective a certain approach might be ;) So that's only why I wonder..

On your previous comment about first gen games, for example, likely not using "parallelised" data structures - is this not something middleware takes care of anyway? Or do you maintain your own collision detection for a particular reason? Actually, I guess there's no guarantee the likes of AGEIA or Havok will have the best performance approach right off the bat anyway, for first gen titles, so..
 
Last edited by a moderator:
Back
Top