nutball said:
Well it depends how far you want to scale, but sorts and FFTs are two which spring immediately to mind. Recursive bisection algorithms basically, need to be re-thought.
Well, sorts are pretty easy to parallelize, for one. You'd just split your list into a number of different lists, and sort each separately. The recombination would then be order N*M operations (N = total objects sorted, M = number of lists separated into), which should typically be shorter than the sorting of long lists, so while the recombination won't be parallelizable, it the bulk of the processing should be.
And the total number of operations shouldn't by a large amount. For example, if you have an efficient sorting algorithm that finishes typically in order N * ln(N) operations, the sorting of each block should now be finished in (N/M) * ln(N/M) operations. The total number of operations needed to sort all of the blocks, then, will be N * ln(N/M). Then you add in the recombination: N*ln(N/M) + N*M = N*ln(N) + N*(M - ln(M)). If we take some sample numbers, say, N = 1e6, M = 4, the total number of executed operations will increase by about 10%. Completion time with four processors, then, would be 25% + 10% = 35% that of single-processor execution. Of course, if you go to too many, this will actually end up being slower, but then you'd have to parallelize at a higher level.
Now, as for FFT's, well, I don't know enough about the algorithm to be sure, but from what I understand the algorithm itself is based upon the idea of separating the fourier transform into two parts, so I don't see why you couldn't get at least a level of parallelism of two to four.
Two threads will be fine, but if you want to talk quad-core + SMT (which is what ... 18-24 months away?) then it becomes much tougher to get those to scale for sensibly sized computational payloads.
Yes, but in any sensible situation you'd probably be wanting to apply the above algorithms many times. For most situations, FFT's and sorts are so fast on modern computers that they really don't become performance intensive until you have to do a whole lot of 'em. So once you get a parallelism of 2-4 in the base algorithm, expanding this to really improve efficiency is relatively easy, as long as you're doing more than one thing at a time.
Any problem that requires a large number of iterations on a small set of data, that will also have problems scaling (where the synchronisation cost outweighs the compute payload per thread per iteration).
Sure, but in most cases I've run into where this happens, you can again just parallelize the code by running multiple instances. For example, one piece of code I'm currently making use of is a Monte Carlo Markov Chain that only takes up about 1kb of memory, and I have to run the program for a couple of hours to get decent results. But this would be trivial to parallelize because if I wanted to, I could just run two (or a hundred) parallel chains, and combine the results in the end. I'd have to be a bit careful about statistics, but provided I could be sure each chain was completely uncorrelated with the others, this wouldn't be a problem at all.