Consider the 3.3 gigabytes of a human genome as equivalent to 3.3 gigabytes of files on the mass–storage device of some computer system of unknown design. Obtaining the sequence is equivalent to obtaining an image of the contents of that mass–storage device. Understanding the sequence is equivalent to reverse engineering that unknown computer system (both the hardware and the 3.3 gigabytes of software) all the way back to a full set of design and maintenance specifications.
Securing the sequence is further complicated by the fact that the mass–storage device is of unknown design and cannot simply be read. At best, experimental methods can be used to obtain tiny fragments of sequence from it. Because these experimental methods are expensive ($5–10 per byte) and error prone, new techniques are constantly being developed and tested. Meanwhile, a database must be designed to hold the fragments that are obtained, along with a full description of the procedures used to generate them. Because the experimental procedures change rapidly and often radically, this is equivalent to designing and maintaining a database for an enterprise whose operating procedures and general business rules change weekly, perhaps daily, with each change requiring modifications to the database schema.
As the sequence fragments accumulate, efforts will be made to synthesize the fragments into larger images of contiguous regions of the mass–storage device, and these synthesized fragments must also be represented in the database. If multiple inferences are consistent with the present data, all of the consistent possibilities must be represented in the database to serve as the basis for further reasoning when more data becomes available. As even larger regions are synthesized, the entire
cascade of premisses, procedures, and logical dependencies must be stored, so that a “logical roll–back†can occur if some new observation renders an earlier premise doubtful. Since all of the data are obtained experimentally, each observation and deduction will have some error term associated with it. The deductive procedures must use these errors to assign probabilities to the various deductive outcomes obtained and stored in the database. Thousands of researchers from independent laboratories will be using the system and contributing data. Each will have a notion of proper error definition and of the rules by which the errors should be combined to provide reliability estimates for the composite sequences. Therefore, the database must be capable of supporting probabilistic views and of returning multiple answers, each with its own associated conditional probabilities, to any query. The involvement of many independent researchers, each employing slightly different experimental procedures and concepts, will also result in extensive nomenclatural synonymy and homonymy. The database must take these nomenclatural difficulties into account and should be able to present users with views consistent with their own preferred usage.
Reverse engineering the sequence is complicated by the fact that the resulting image of the mass–storage device will not be a file–by–file copy, but rather a streaming dump of the bytes in the order they occupied on the device and the files are known to be fragmented. In addition, some of the device is known to contain erased files or other garbage. Once the garbage has been recognized and discarded and the fragmented files reassembled, the reverse engineering of the codes must be undertaken with only a partial, and sometimes incorrect understanding of the CPU on which the codes run. In fact, deducing the structure and function of the CPU is part of the project, since some of the 3.3 billion gigabytes are known to be the binary specifications for the computer–assisted–manufacturing process that fabricates the CPU. In addition, one must also consider that the huge database also contains code generated from the result of literally millions of maintenance revisions performed by the worst possible set of kludge–using, spaghetti–coding, opportunistic hackers who delight in clever tricks like writing self–modifying code and relying upon undocumented system quirks.