DNA As a Data Source

To a programming language, DNA is simply a string:

char(3*10^6) human_genome;

The full genomic information for man consists of 3 billion characters and is easily handled in memory by even the most inefficient home-brewed language. However, the process of determining the exact order of these 3 billion bases requires a significant effort spanning chemistry, bioinformatics, laboratory procedures, and a lot of spinning disks.

The Human Genome Project aimed, for the first time, to sequence every one of these characters. A number of large, high-throughput institutes from around the world put academic competition aside and set about a task that would last 13 years and consume billions of dollars. Their aim was to produce a robust, accurate map of the human genome, available to all, for free. The consortium of scientists from the UK, America, and Japan succeeded, with the first draft human genome appearing in the scientific literature in February 2001. The genome, without any additional annotations or associated data, weighed in at 10 gigabits, a reasonably large size in an era without iPods or USB thumb drives. However, the overall weight of this data was much greater, thanks to the exponentially increasing storage requirements as this data was replicated across the globe. Scientists proceeded to analyze the data, scouring it for genetic markers and disease indicators, and comparing it to other available genomes from mice, yeast, and pathogens. These 10 gigabits have ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.