PDA

View Full Version : Optimizing for CELL - Wavelet Image Compression


[maven]
20-Mar-2007, 07:02
I'm planning on creating an experimental CELL BBE optimised version of my wavelet image compression code, and (depending on the resulting code) merging the results back into mainline. So I spent yesterday evening installing Yellow Dog Linux 5.0 on my PS3 (which wasn't exactly trivial with the 2 separately obtained boot-loaders, but neither was it terribly complicated). The distro itself feels very prematurely released (plenty of crashes and incorrectly preconfigured stuff), but no software updates (that I could find) are available... :(

So then, for the code to run decently on SPUs I'll have to a) vectorise the code and b) parallelise it.

One of the main choices where my intuition completely fails is how to decide whether you want a single SPU streaming over a bunch of data (suitably double-buffered etc.) or whether to split your work up in SPU-sized chunks and farming them out as SPUs are available. In the first case, you only use a subset of the available computational resources (depending on other jobs running on SPUs) but you overlap your IO and computation. In the second case, you use as many SPUs as are available (or your data-set partitions into) but the SPUs will be idle waiting for IO to complete (because the job each SPU is dedicated to is changing all the time, i.e. after one local store has been processed).

Here are a few candidates, with a few remarks on what is done in each routine.

recurse_hilbert - (Recursive) Hilbert Curve computation used for creating a reorder table, should map well to 2xSIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/reorder.c

reorder / unreorder (Scatter / Gather) - using said table to permute a (large) 2D array into a 1D array. Needs to fetch spatially close (in some sense) data as described by the reorder table and stream it out in a different order. Also needs to stream in the reorder table as needed. No SIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/reorder.c

rice_encode_plane - Run-length bit-plane coding with variable coefficients. Should be able to do each bitplane in parallel if the k-coefficient is reset at the start of plane. No SIMD.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/coding.c

wavelet_transform - The (inverse) wavelet transform itself. Walks over data with increasing stride, which a some point can be larger than LS. Could probably do SIMD but I don't see how yet.
http://www.maven.de/code/repos/?repo=wavelet&view=files&file=/transform.c

Any clever ideas from the guys used to CELL?

Andrew Lauritzen
21-Mar-2007, 00:04
You certainly want to hide DMA latency by issuing the DMA, and then working on other data. However that isn't fundamentally incompatible with SPU scheduling - you just have to do it cleverly, always making sure that an SPU is assignment more work in time to cover the latency of transferring the data. Using as close to data-parallelism as possible can make the work times very synchronous and predictable.

MfA
25-Mar-2007, 03:13
Can't you just change the coding a tad ... it's actively hostile towards parallelization at the moment. Tiling the transform coefficients and coding them separately would help a lot ... like precincts and codeblocks in JPEG2000.

The transform is trivial to SIMD in the vertical direction obviously, in the horizontal things get more difficult. Instruction latencies can get nasty though with lifting, you could probably get speedup by working on multiple sets of rows/columns at a time ... you have plenty of registers for those kind of tricks, and no cache to worry about trashing.

To parallelize the wavelet transform there is something called the Local Wavelet Transform. Basically you just take overlapping tiles from the image and perform the wavelet transform on those tiles. You will only need communication after the first few steps of the transform, at which point almost all the work has already been done.

[maven]
25-Mar-2007, 10:34
Can't you just change the coding a tad ... it's actively hostile towards parallelization at the moment. Tiling the transform coefficients and coding them separately would help a lot ... like precincts and codeblocks in JPEG2000.

Oh, I can change the code as much as I want... ;) And I have thought about subdividing the image as you say, but not yet come to a conclusion on that. If I wanted that sort of data parallelism, I could just decide to en-/decode N channels at the same time. But you lose the embedding property, where truncating the file yields the "best" representation for the left-over bitrate.

Thanks for taking a look and the pointers / advice; the advantage with precincts and codeblocks is that you can also do huge out-of-core images, which I also find interesting... I will also check out the local wavelet transform.

MfA
25-Mar-2007, 18:44
;955024']Oh, I can change the code as much as I want... ;) And I have thought about subdividing the image as you say, but not yet come to a conclusion on that. If I wanted that sort of data parallelism, I could just decide to en-/decode N channels at the same time. But you lose the embedding property, where truncating the file yields the "best" representation for the left-over bitrate.
Well you could just split the lower part of the the coefficient "pyramid" up, throw each SPE a part (happens automatically with the local wavelet transform) divide the bitplanes into blocks as you do now and after all is said and done do the R/D optimized sequencing and final coding of the blocks on the PPE. Still partly parallelized.