System for string handling

dog

Newcomer
Hi,

I'm looking to build a system (hardware and software) to handle strings fast, without doing much math at all. Most of the tasks will be based on searching and statistics:

Searching: does this chunk of data include this particular string?
Where in this chunk of data does this string occur? (these slightly different functions are necessary for different tasks). Strings to search for may contain wildcards: the option of flagging a positive match regardless of certain bits: either a known number of bits ([sequence] [14 bits not to check] [more sequence]) or an unknown number, and perhaps some more complex rules -- but most of the matches will be literal.

Statistics: is there a correlation between the occurence of string a and the occurence of string b...
immediately before it
immediately after it
within a certain distance before/after
in a certain location in my input data chunk (e.g. 17th to 24th bytes)
anywhere within the same chunk of data
...etc

My aim is to implement all of this in an utterly astoundingly immensely parallel way: search an input binary data chunk (say 5k) for 500 million different test strings -- needless to say, the vast majority of them won't appear. In general, I want high processing throughput, but I don't have any mission-critical or time-critical applications going on. I can take the odd failure, crash, or error in my stride.


I don't know whether this is something that GPGPU would be good for. I can imagine being able to load the input data to a graphics card over PCIe and then take advantage of massive GPU memory bandwidth to scan it for a vast matrix of test strings stored in the GPU's own memory, thus doing some good filtering and not having to worry about the relatively low bandwidth between GPU and main memory. I've seen a couple of papers on GPU for SQL queries or searching network data streams for threats, but I know this application goes against the classic GPGPU dictum that arithmetic intensity should be high, so I don't know how much real world memory bandwidth I can expect to see. I also realize that a mid-range GPU with high bandwidth but low computational power might fit the bill.

Alternatively I could do the whole thing on CPUs, in which case I still don't know what system architecture is best for such a task. AMD or Intel? A large number of less powerful units, or a small number of powerful core i7s or such like? (The $1000 one is probably out of my price range, though!) A consideration is how much RAM I can stack onto it, as I'm guessing I get more memory for my dollar with cheaper CPUs and motherboards, even if memory per node is lower. I imagine I'll be using a cluster on a gigabit ethernet backbone, but I'm considering a heterogenous cluster -- for instance, if GPUs can give me a big speed up on the filtering, I might not need to have one on each node. In any case, I get the feeling that mass market solutions are much better value that high performance niche hardware (server equipment, Cell processors, etc).

In total, I don't want the cost of the system to go much over US$10,000, so it's a question of how much processing I can get for that money -- anything from maybe 5 big motherboards with core i7 processors and multiple graphics cards, through to a hundred little intel atom systems linked together (which sounds very cool to me, but is almost certainly not a serious option).

If I start developing now it'll still take a good few months or a year to get the software done, so if needs be I can probably wait for Fermi or Larrabee if they might present options for a major enhancement. As I have never programmed a GPU I'll need to learn that from scratch, but that's not a problem; I like learning, have lots of time on my hands, and tend to pick things up fast. A final thought: I live in a country where electricity is expensive, so a power-efficient system would be a big plus.

Huge thanks to anyone who's bothered to read this, particularly if you can give me any orientation.
 
After thinking about this for a whole 10 seconds
for finding the qualifying strings in the first place i'd probably use sql
then write some code to do the rest
 
My aim is to implement all of this in an utterly astoundingly immensely parallel way: search an input binary data chunk (say 5k) for 500 million different test strings -- needless to say, the vast majority of them won't appear. In general, I want high processing throughput, but I don't have any mission-critical or time-critical applications going on. I can take the odd failure, crash, or error in my stride.

Matching a 5kbyte string against 500 million test strings?? As far as I can tell, the algorithm of choice for something like that would be the Rabin-Karp algorithm or some derivative thereof. In that case, the limiting factor would be the construction of a vast hashtable containing the 500M test strings; once the hashtable is fully formed, matching the 5kbyte string against it should be quite fast (thousands of cpu cycles, not billions).

The construction of such a hashtable should be parallellizable with some hassle, however it's likely to be bounded by memory bandwidth/transaction rate rather than compute power. If you construct the table on CPUs, I suspect that memory traffic would kill scalability beyond 1-2 sockets or so.

If you go the GPU route, you will probably need multiple GPUs just to get enough onboard memory to hold the hashtable (with each GPU ultimately only holding a part of the table); in such a scenario, you'd be distributing each incoming string to every GPU, which then tests against its part of the hashtable, then sending the results back to the CPU to combine the subset results. This may or may not be a good idea; the main concern is I/O overhead.
 
Great answer, that's helped a lot, the Rabin Karp algorithm looks fantastic for this purpose, and I hadn't come across it before.
A couple of clarifications in terms of the task: first (as I think Arjan understood, but best to make it explicit), my test strings are going to be smaller than the 5kbyte string I'm testing them against, and might occur at any point within it. A typical test string (the ones of which there are millions) might be between 10 and a hundred bytes. Second, as long as we're not talking months of computer time, having it take a long time to build the hash table isn't going to be a problem as long as I can: 1, then run search string through it fast and 2, modify my hash tables without having to rebuild from scratch (add or remove test strings). The parallelization should be pretty trivial if we think of it as five sets of 100 million test strings each (or fifty of 10m, or whatever), each with its own hash table and running on its own node. Message passing and tight linkage of the cluster is no big deal, as there isn't much information to intercommunicate.

In terms of the hardware, I know a lot of what I'm doing is going to happen on CPUs, but I get the feeling that we can use the memory bandwidth of the GPU to speed up certain steps. The memory I/O is the big deal in all this, as the computational tasks on each piece of data are pretty light. I notice that some of the mid-range GPUs on offer have lower power consumption and respectable memory bandwidth of 50-70GB/s (my key needs), but are cheaper as they have fewer cores and so fewer gigaflops (not so important to me).

It's a bioinformatics thing, by the way....
 
The simplest, cheapest, fastest and most scalable way to put the hardware together would most likely be a cluster of PS3's.

That's also probably the most capable hardware to do something like that short of a supercomputer.

You could use hashing, but you could also simply use a combination of a substr() function and regular expressions (tweaked to give you the numerical data you want), for the most bang for your buck. Create a small program that can do both, load it into each SPI core, and use the Power cores to distribute and gather.

And, as Davros said: throw all the results into a database and use that to do the analysis.
 
Back
Top