[maven]
20-Mar-2007, 07:02
I'm planning on creating an experimental CELL BBE optimised version of my wavelet image compression code, and (depending on the resulting code) merging the results back into mainline. So I spent yesterday evening installing Yellow Dog Linux 5.0 on my PS3 (which wasn't exactly trivial with the 2 separately obtained boot-loaders, but neither was it terribly complicated). The distro itself feels very prematurely released (plenty of crashes and incorrectly preconfigured stuff), but no software updates (that I could find) are available... :(
So then, for the code to run decently on SPUs I'll have to a) vectorise the code and b) parallelise it.
One of the main choices where my intuition completely fails is how to decide whether you want a single SPU streaming over a bunch of data (suitably double-buffered etc.) or whether to split your work up in SPU-sized chunks and farming them out as SPUs are available. In the first case, you only use a subset of the available computational resources (depending on other jobs running on SPUs) but you overlap your IO and computation. In the second case, you use as many SPUs as are available (or your data-set partitions into) but the SPUs will be idle waiting for IO to complete (because the job each SPU is dedicated to is changing all the time, i.e. after one local store has been processed).
Here are a few candidates, with a few remarks on what is done in each routine.
recurse_hilbert - (Recursive) Hilbert Curve computation used for creating a reorder table, should map well to 2xSIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/reorder.c
reorder / unreorder (Scatter / Gather) - using said table to permute a (large) 2D array into a 1D array. Needs to fetch spatially close (in some sense) data as described by the reorder table and stream it out in a different order. Also needs to stream in the reorder table as needed. No SIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/reorder.c
rice_encode_plane - Run-length bit-plane coding with variable coefficients. Should be able to do each bitplane in parallel if the k-coefficient is reset at the start of plane. No SIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/coding.c
wavelet_transform - The (inverse) wavelet transform itself. Walks over data with increasing stride, which a some point can be larger than LS. Could probably do SIMD but I don't see how yet.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/transform.c
Any clever ideas from the guys used to CELL?
So then, for the code to run decently on SPUs I'll have to a) vectorise the code and b) parallelise it.
One of the main choices where my intuition completely fails is how to decide whether you want a single SPU streaming over a bunch of data (suitably double-buffered etc.) or whether to split your work up in SPU-sized chunks and farming them out as SPUs are available. In the first case, you only use a subset of the available computational resources (depending on other jobs running on SPUs) but you overlap your IO and computation. In the second case, you use as many SPUs as are available (or your data-set partitions into) but the SPUs will be idle waiting for IO to complete (because the job each SPU is dedicated to is changing all the time, i.e. after one local store has been processed).
Here are a few candidates, with a few remarks on what is done in each routine.
recurse_hilbert - (Recursive) Hilbert Curve computation used for creating a reorder table, should map well to 2xSIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/reorder.c
reorder / unreorder (Scatter / Gather) - using said table to permute a (large) 2D array into a 1D array. Needs to fetch spatially close (in some sense) data as described by the reorder table and stream it out in a different order. Also needs to stream in the reorder table as needed. No SIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/reorder.c
rice_encode_plane - Run-length bit-plane coding with variable coefficients. Should be able to do each bitplane in parallel if the k-coefficient is reset at the start of plane. No SIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/coding.c
wavelet_transform - The (inverse) wavelet transform itself. Walks over data with increasing stride, which a some point can be larger than LS. Could probably do SIMD but I don't see how yet.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/transform.c
Any clever ideas from the guys used to CELL?