mckmas8808 said:Sooo Gubbi basically what does that mean for the PS3 games using lots of physics?
Nothing at all. It just means you can't correllate peak flops directly to physics performance.
Cheers
Last edited by a moderator:
mckmas8808 said:Sooo Gubbi basically what does that mean for the PS3 games using lots of physics?
SPE addresses those cases with LS. The latency is low, and a programer may be able to design a good app-specific prefetcher.aaronspink said:As much as people like to malign caches, for HPTC type code, they can work pretty well. In the cases where they don't, you are either memory bandwidth limited or memory latency limited (ex streams and linked lists). In the case of stream like workloads, there is little you can do but to increase the number of outstanding memory accesses and the actual realized bandwidth of the memory system. For the linked list type cases, the only option is to either decrease the memory latency or have a very very good prefetcher that trades off usable bandwidth for latency.
one said:That said, will a game programer throw a sloppy, abstract objects like linked list to SPE? If the exotic configuration of Cell forces developers to think about optimization in early stages, it's not that bad.
Gubbi said:If you want physics on your SPUs, you'll need a space decomposition structure (for collisions), that's likely to be an oct-tree or a K-D tree. A tree regardless.
With respect to AI, I don't have the answers, but I have good hope. I think AI is usually not bound by computation but by memory access penalties (on high-frequency processors). I can imagine tree search algorithms for the SPEs that absolutely rock, by getting a lot of memory accesses in flight concurrently. There may be a patent out there by M. Necker and myself that describes some of this for the case of routing table accesses ( also a kind of tree search ).
According to DemoCoder in this thread (your previous benchmark test is in it too)Gubbi said:If you want physics on your SPUs, you'll need a space decomposition structure (for collisions), that's likely to be an oct-tree or a K-D tree. A tree regardless.
Trees are made up of nodes, linked together by pointers. If the tree can be contained in the LS, fine, if not....
Gubbi said:Trees are made up of nodes, linked together by pointers. If the tree can be contained in the LS, fine, if not....
Cheers
one said:SPE addresses those cases with LS. The latency is low, and a programer may be able to design a good app-specific prefetcher.
That said, will a game programer throw a sloppy, abstract objects like linked list to SPE? If the exotic configuration of Cell forces developers to think about optimization in early stages, it's not that bad.
Titanio said:You can make these structures very compact. My memory's a little hazy, but IIRC in a K-D tree, you can represent a node with 32-bits (4 bytes) (leaf contents are stored elsewhere..but with compression and the structure i'm thinking of here, you could simultaneously store about 4,000 nodes and 38,000 vertices given 128KB of data. For leaf contents, you'd be preloading/prefetching..there are easy ways to do this, and then perhaps smarter ways. Indeed, given hardware cache control or none at all, it may be beneficial to implement a software cache for something like this regardless). I'm not sure how typically large a tree is, though, I'll admit. There's possibly also the potential for splitting a tree and handing different branches to different SPUs for searches and the like?
one said:According to DemoCoder in this thread (your previous benchmark test is in it too)
http://www.beyond3d.com/forum/showthread.php?t=21328&page=3
some trees are streamable. So I think it depends.
ERP said:And this is where parallelism starts to get interesting.
You can devise data structures that will let you split collision sets amongst SPE's but as some cost to maintaining and referencing those data sets, you may even speculatively do redundant work. The point is nothing is free, for most tasks by increasing the parallelism you invariable reduce the efficiency of a single thread.
aaronspink said:The LS only provides an advantage IF and ONLY IF, you can reasonably fit the data set size within the LS.
Edge said:Totally false, as IBM's landscape demo shows while dealing with a matrix 128 MB in size. It could not even come close to fitting in LS memory, and certainly had no problems handling it.
Note the SPE's were also handling 200 MB of textures, so in total over 300 MB of data.
aaronspink said:The 128 MB matrix isn't the working data set size. At any given time quantum the program is only dealing with a fraction of the matrix as its working set size. Just like you don't load all of a 4kx4k texture, you only load the portion that you actually need. For the terain render, they have a significant amount of temporal locality.
Edge said:That's right a solution is found by breaking the data into chucks. Not every problem will have this as a solution, but then again you do have that PPE to work with also.
Where the
plane intersects the view screen are the locations of
the accumulation buffer to be modified [figure 1].
Where the plane intersects the height map are the
locations of the input height data needed to process the
vertical cut of samples [figure 2]. The PPE
communicates the range of work blocks each SPE is
responsible for via base address, count, and stride
information once per frame. The SPE then uses this
information to compute the address of each work
block which is then read into local store via a DMA
read and processed. This decouples the PPE from the
vertical cut by vertical cut processing allowing it to
prep the next frame in parallel. The SPEs place the
output samples in an accumulation buffer using a
gather DMA read and a scatter DMA write. Each SPE
is responsible for four regions of the screen and the
vertical cuts are processed in a round robin fashion,
one vertical cut per region, and left to right within a
each region, so no synchronization is need on the
output as no two SPEs will ever attempt to modify the
same locations in the accumulation buffer even with
two vertical cuts in flight (double buffering) per SPE.
The SPEs are therefore free to run at their own pace
processing each vertical cut without any data
synchronization on the input or output. None of the
SPE’s input or output data is touched by the PPE,
thereby protecting the PPE’s cache hierarchy from
transient streaming data.
ERP said:And this is where parallelism starts to get interesting.
You can devise data structures that will let you split collision sets amongst SPE's but as some cost to maintaining and referencing those data sets, you may even speculatively do redundant work. The point is nothing is free, for most tasks by increasing the parallelism you invariable reduce the efficiency of a single thread.
Gubbi said:That sounds interesting.
So how do you partition this on the various SPEs ? Do you maximize spatial locality? Or do you distribute data, so that the SPEs are equally loaded ?
If I look at our collision model at the moment, I think I'd want to run first couple box tests on PPE and then offload the relevant subtrees to SPE. Just need to reorganize data so subtrees smaller then "somesize" are stored linearly in memory.Gubbi said:So how do you partition this on the various SPEs ? Do you maximize spatial locality? Or do you distribute data, so that the SPEs are equally loaded ?
Of course they picked a trivially chunkable algorithm... graphics is known to be embarrasingly parellel.Edge said:That's right a solution is found by breaking the data into chucks. Not every problem will have this as a solution, but then again you do have that PPE to work with also.
Its just another complication... we already modify our efficient tree structures for the platform we are running on (see Faf last post. things like cache line size, VU optimised workloads etc.). So its not a new problem, just a different set of issues to be solved.Titanio said:Anyone have any idea how typically big a tree is? Or any guesses as to how big a tree might get? Number of nodes, number of leaf nodes?
DeanoC said:Its just another complication... we already modify our efficient tree structures for the platform we are running on (see Faf last post. things like cache line size, VU optimised workloads etc.). So its not a new problem, just a different set of issues to be solved.