BUY THIS BOOK
Add to Cart

Print Book $34.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £24.95

What is this?

Looking to Reprint this content?


Developing Bioinformatics Computer Skills
Developing Bioinformatics Computer Skills

By Cynthia Gibas, Per Jambeck
Price: $34.95 USD
£24.95 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Biology in the Computer Age
From the interaction of species and populations, to the function of tissues and cells within an individual organism, biology is defined as the study of living things. In the course of that study, biologists collect and interpret data. Now, at the beginning of the 21st century, we use sophisticated laboratory technology that allows us to collect data faster than we can interpret it. We have vast volumes of DNA sequence data at our fingertips. But how do we figure out which parts of that DNA control the various chemical processes of life? We know the function and structure of some proteins, but how do we determine the function of new proteins? And how do we predict what a protein will look like, based on knowledge of its sequence? We understand the relatively simple code that translates DNA into protein. But how do we find meaningful new words in the code and add them to the DNA-protein dictionary?
Bioinformatics is the science of using information to understand biology; it's the tool we can use to help us answer these questions and many others like them. Unfortunately, with all the hype about mapping the human genome, bioinformatics has achieved buzzword status; the term is being used in a number of ways, depending on who is using it. Strictly speaking, bioinformatics is a subset of the larger field of computational biology , the application of quantitative analytical techniques in modeling biological systems. In this book, we stray from bioinformatics into computational biology and back again. The distinctions between the two aren't important for our purpose here, which is to cover a range of tools and techniques we believe are critical for molecular biologists who want to understand and apply the basic computational tools that are available today.
The field of bioinformatics relies heavily on work by experts in statistical methods and pattern recognition. Researchers come to bioinformatics from many fields, including mathematics, computer science, and linguistics. Unfortunately, biology is a science of the specific as well as the general. Bioinformatics is full of pitfalls for those who look for patterns and make predictions without a complete understanding of where biological data comes from and what it means. By providing algorithms, databases, user interfaces, and statistical tools, bioinformatics makes it possible to do exciting things such as compare DNA sequences and generate results that are potentially significant. "Potentially significant" is perhaps the most important phrase. These new tools also give you the opportunity to overinterpret data and assign meaning where none really exists. We can't overstate the importance of understanding the limitations of these tools. But once you gain that understanding and become an intelligent consumer of bioinformatics methods, the speed at which your research progresses can be truly amazing.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Is Computing Changing Biology?
An organism's hereditary and functional information is stored as DNA, RNA, and proteins, all of which are linear chains composed of smaller molecules. These macromolecules are assembled from a fixed alphabet of well-understood chemicals: DNA is made up of four deoxyribonucleotides (adenine, thymine, cytosine, and guanine), RNA is made up from the four ribonucleotides (adenine, uracil, cytosine, and guanine), and proteins are made from the 20 amino acids. Because these macromolecules are linear chains of defined components, they can be represented as sequences of symbols. These sequences can then be compared to find similarities that suggest the molecules are related by form or function.
Sequence comparison is possibly the most useful computational tool to emerge for molecular biologists. The World Wide Web has made it possible for a single public database of genome sequence data to provide services through a uniform interface to a worldwide community of users. With a commonly used computer program called fsBLAST, a molecular biologist can compare an uncharacterized DNA sequence to the entire publicly held collection of DNA sequences. In the next section, we present an example of how sequence comparison using the BLAST program can help you gain insight into a real disease.
Fruit flies (Drosophila melanogaster ) are a popular model system for the study of development of animals from embryo to adult. Fruit flies have a gene called eyeless, which, if it's "knocked out" (i.e., eliminated from the genome using molecular biology methods), results in fruit flies with no eyes. It's obvious that the eyeless gene plays a role in eye development.
Researchers have identified a human gene responsible for a condition called aniridia . In humans who are missing this gene (or in whom the gene has mutated just enough for its protein product to stop functioning properly), the eyes develop without irises.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Isn't Bioinformatics Just About Building Databases?
Much of what we currently think of as part of bioinformatics—sequence comparison, sequence database searching, sequence analysis—is more complicated than just designing and populating databases. Bioinformaticians (or computational biologists) go beyond just capturing, managing, and presenting data, drawing inspiration from a wide variety of quantitative fields, including statistics, physics, computer science, and engineering. Figure 1-2 shows how quantitative science intersects with biology at every level, from analysis of sequence data and protein structure, to metabolic modeling, to quantitative analysis of populations and ecology.
Figure 1-2: How technology intersects with biology
Bioinformatics is first and foremost a component of the biological sciences. The main goal of bioinformatics isn't developing the most elegant algorithms or the most arcane analyses; the goal is finding out how living things work. Like the molecular biology methods that greatly expanded what biologists were capable of studying, bioinformatics is a tool and not an end in itself. Bioinformaticians are the tool-builders, and it's critical that they understand biological problems as well as computational solutions in order to produce useful tools.
Research in bioinformatics and computational biology can encompass anything from abstraction of the properties of a biological system into a mathematical or physical model, to implementation of new algorithms for data analysis, to the development of databases and web tools to access them.
Biology as a science of the specific means that biologists need to remember a lot of details as well as general principles. Biologists have been dealing with problems of information management since the 17th century.
The roots of the concept of evolution lie in the work of early biologists who catalogued and compared species of living things. The cataloguing of species was the preoccupation of biologists for nearly three centuries, beginning with animals and plants and continuing with microscopic life upon the invention of the compound microscope. New forms of life and fossils of previously unknown, extinct life forms are still being discovered even today.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Does Informatics Mean to Biologists?
The science of informatics is concerned with the representation, organization, manipulation, distribution, maintenance, and use of information, particularly in digital form. There is more than one interpretation of what bioinformatics—the intersection of informatics and biology—actually means, and it's quite possible to go out and apply for a job doing bioinformatics and find that the expectations of the job are entirely different than you thought.
The functional aspect of bioinformatics is the representation, storage, and distribution of data. Intelligent design of data formats and databases, creation of tools to query those databases, and development of user interfaces that bring together different tools to allow the user to ask complex questions about the data are all aspects of the development of bioinformatics infrastructure.
Developing analytical tools to discover knowledge in data is the second, and more scientific, aspect of bioinformatics. There are many levels at which we use biological information, whether we are comparing sequences to develop a hypothesis about the function of a newly discovered gene, breaking down known 3D protein structures into bits to find patterns that can help predict how the protein folds, or modeling how proteins and metabolites in a cell work together to make the cell function. The ultimate goal of analytical bioinformaticians is to develop predictive methods that allow scientists to model the function and phenotype of an organism based only on its genome sequence. This is a grand goal, and one that will be approached only in small steps, by many scientists working together.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Challenges Does Biology Offer Computer Scientists?
The goal of biology, in the era of the genome projects, is to develop a quantitative understanding of how living things are built from the genome that encodes them.
Cracking the genome code is complex. At the very simplest level, we still have difficulty identifying unknown genes by computer analysis of genomic sequence. We still have not managed to predict or model how a chain of amino acids folds into the specific structure of a functional protein.
Beyond the single-molecule level, the challenges are immense. The sheer amount of data in GenBank is now growing at an exponential rate, and as datatypes beyond DNA, RNA, and protein sequence begin to undergo the same kind of explosion, simply managing, accessing, and presenting this data to users in an intelligible form is a critical task. Human-computer interaction specialists need to work closely with academic and clinical researchers in the biological sciences to manage such staggering amounts of data.
Biological data is very complex and interlinked. A spot on a DNA array, for instance, is connected not only to immediate information about its intensity, but to layers of information about genomic location, DNA sequence, structure, function, and more. Creating information systems that allow biologists to seamlessly follow these links without getting lost in a sea of information is also a huge opportunity for computer scientists.
Finally, each gene in the genome isn't an independent entity. Multiple genes interact to form biochemical pathways, which in turn feed into other pathways. Biochemistry is influenced by the external environment, by interaction with pathogens, and by other stimuli. Putting genomic and biochemical data together into quantitative and predictive models of biochemistry and physiology will be the work of a generation of computational biologists. Computer scientists, mathematicians, and statisticians will be a vital part of this effort.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Skills Should a Bioinformatician Have?
There's a wide range of topics that are useful if you're interested in pursuing bioinformatics, and it's not possible to learn them all. However, in our conversations with scientists working at companies such as Celera Genomics and Eli Lilly, we've picked up on the following "core requirements" for bioinformaticians:
  • You should have a fairly deep background in some aspect of molecular biology. It can be biochemistry, molecular biology, molecular biophysics, or even molecular modeling, but without a core of knowledge of molecular biology you will, as one person told us, "run into brick walls too often."
  • You must absolutely understand the central dogma of molecular biology. Understanding how and why DNA sequence is transcribed into RNA and translated into protein is vital. (In Chapter 2, we define the central dogma, as well as review the processes of transcription and translation.)
  • You should have substantial experience with at least one or two major molecular biology software packages, either for sequence analysis or molecular modeling. The experience of learning one of these packages makes it much easier to learn to use other software quickly.
  • You should be comfortable working in a command-line computing environment. Working in Linux or Unix will provide this experience.
  • You should have experience with programming in a computer language such as C/C++, as well as in a scripting language such as Perl or Python.
There are a variety of other advanced skill sets that can add value to this background: molecular evolution and systematics; physical chemistry—kinetics, thermodynamics and statistical mechanics; statistics and probabilistic methods; database design and implementation; algorithm development; molecular biology laboratory methods; and others.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Should Biologists Use Computers?
Computers are powerful devices for understanding any system that can be described in a mathematical way. As our understanding of biological processes has grown and deepened, it isn't surprising, then, that the disciplines of computational biology and, more recently, bioinformatics, have evolved from the intersection of classical biology, mathematics, and computer science.
Biochemistry is often an anecdotal science. If you notice a disease or trait of interest, the imperative to understand it may drive the progress of research in that direction. Based on their interest in a particular biochemical process, biochemists have determined the sequence or structure or analyzed the expression characteristics of a single gene product at a time. Often this leads to a detailed understanding of one biochemical pathway or even one protein. How a pathway or protein interacts with other biological components can easily remain a mystery, due to lack of hands to do the work, or even because the need to do a particular experiment isn't communicated to other scientists effectively.
The Internet has changed how scientists share data and made it possible for one central warehouse of information to serve an entire research community. But more importantly, experimental technologies are rapidly advancing to the point at which it's possible to imagine systematically collecting all the data of a particular type in a central "factory" and then distributing it to researchers to be interpreted.
In the 1990s, the biology community embarked on an unprecedented project: sequencing all the DNA in the human genome. Even though a first draft of the human genome sequence has been completed, automated sequencers are still running around the clock, determining the entire sequences of genomes from various life forms that are commonly used for biological research. And we're still fine-tuning the data we've gathered about the human genome over the last 10 years. Immense strings of data, in which the locations of only a relatively few important genes are known, have been and still are being generated. Using image-processing techniques, maps of entire genomes can now be generated much more quickly than they could with chemical mapping techniques, but even with this technology, complete and detailed mapping of the genomic data that is now being produced may take years.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Can I Configure a PC to Do Bioinformatics Research?
Up to now you've probably gotten by using word-processing software and other canned programs that run under user-friendly operating systems such as Windows or MacOs. In order to make the most of bioinformatics, you need to learn Unix, the classic operating system of powerful computers known as servers and workstations. Most scientific software is developed on Unix machines, and serious researchers will want access to programs that can be run only under Unix. Unix comes in a number of flavors, the two most popular being BSD and SunOs. Recently, however, a third choice has entered the marketplace: Linux. Linux is an open source Unix operating system. In Chapter 3, Chapter 4, and Chapter 5, we discuss how to set up a workstation for bioinformatics running under Linux. We cover the operating system and how it works: how files are organized, how programs are run, how processes are managed, and most importantly, what to type at the command prompt to get the computer to do what you want.
Setting up your computer with a Linux operating system allows you to take advantage of cutting-edge scientific-research tools developed for Unix systems. As it has grown popular in the mass market, Linux has retained the power of Unix systems for developing, compiling, and running programs, networking, and managing jobs started by multiple users, while also providing the standard trimmings of a desktop PC, including word processors, graphics programs, and even visual programming tools. This book operates on the assumption that you're willing to learn how to work on a Unix system and that you'll be working on a machine that has Linux or another flavor of Unix installed. For many of the specific bioinformatics tools we discuss, Unix is the most practical choice.
On the other hand, Unix isn't necessarily the most practical choice for office productivity in a predominantly Mac or PC environment. The selection of available word processing and desktop publishing software and peripheral devices for Linux is improving as the popularity of the operating system increases. However, it can't (yet) go head-to-head with the consumer operating systems in these areas. Linux is no more difficult to maintain than a normal PC operating system, once you know how, but the skills needed and the problems you'll encounter will be new at first.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Information and Software Are Available?
In Chapter 6, we cover information literacy. Only a few years ago, biologists had to know how to do literature searches using printed indexes that led them to references in the appropriate technical journals. Modern biologists search web-based databases for the same information and have access to dozens of other information types as well. Knowing how to navigate these resources is a vital skill for every biologist, computational or not.
We then introduce the basic tools you'll need to locate databases, computer programs, and other resources on the Web, to transfer these resources to your computer, and to make them work once you get them there. In Chapter 7 through Chapter 11 we turn to particular types of scientific questions and the tools you will need to answer them. In some cases, there are computer programs that are becoming the standard for solving a particular type of problem (e.g., BLAST and FASTA for amino acid and nucleic acid sequence alignment). In other areas, where the method for solving a problem is still an open research question, there may be a number of competing tools, or there may be no tool that completely solves the problem.
Handling large volumes of complex data requires a systematic and automated approach. If you're searching a database for matches to one query, a web form will do the trick. But what if you want to search for matches to 10,000 queries, and then sort through the information you get back to find relationships in the results? You certainly don't want to type 10,000 queries into a web form, and you probably don't want your results to come back formatted to look nice on a web page. Shared public web servers are often slow, and using them to process large batches of data is impractical. Chapter 12 contains examples of how to use Perl as a driver to make your favorite program process large volumes of data using your own computer.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Can I Learn a Programming Language Without Classes?
Anyone who has experience with designing and carrying out an experiment to answer a question has the basic skills needed to program a computer. A laboratory experiment begins with a question, which evolves into a testable hypothesis, that is, a statement that can be tested for truth based on the results of an experiment or experiments. The processes developed to test the hypotheses are analogous to computer programs. The essence of an experiment is: if you take system X, and do something to it, what happens? The experiment that is done must be designed to have results that can be clearly interpreted. Computer programs must also be carefully designed so that the values that are passed from one part of a program to the next can be clearly interpreted. The human programmer must set up unambiguous instructions to the computer and must think through, in advance, what different types of results mean and what the computer should do with them. A large part of practical computer programming is the ability to think critically, to design a process to answer a question, and to understand what is required to answer the question unambiguously.
Even if you have these skills, learning a computer language isn't a trivial undertaking, but it has been made a lot easier in recent years by the development of the Perl language. Perl, referred to by its creator as "the duct tape of the Internet, and of everything else," began its evolution as a scripting language optimized for data processing. It continues to evolve into a full-featured programming language, and it's practical to use Perl to develop prototypes for virtually any kind of computer program. Perl is a very flexible language; you can learn just enough to write a simple script to solve a one-off problem, and after you've done that once or twice, you have a core of knowledge to build on. The key to learning Perl is to use it and to use it right away. Just as no amount of reading the textbook can make you speak Spanish fluently, no amount of reading O'Reilly's Learning Perl is going to be as helpful as getting out there and trying to "speak" it. In Chapter 12, we provide example Perl code for parsing common biological datatypes, driving and processing output from programs written in other languages, and even a couple of Perl implementations that solve common computational biology problems. We hope these examples inspire you to try a little programming of your own.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Can I Use Web Information?
Chapter 6 also introduces the public databases where biological data is archived to be shared by researchers worldwide.
While you can quickly find a single protein structure file or DNA sequence file by filling in a web form and searching a public database, it's likely that eventually you will want to work with more than one piece of data. You may even be collecting and archiving your own data; you may want to make a new type of data available to a broader research community. To do these things efficiently, you need to store data on your own computer. If you want to process your stored data using a computer program, you need to structure your data. Understanding the difference between structured and unstructured data and designing a data format that suits your data storage and access needs is the key to making your data useful and accessible.
There are many ways to organize data. While most biological data is still stored in flat file databases, this type of database becomes inefficient when the quantity of data being stored becomes extremely large. Chapter 13 covers the basic database concepts you need to talk to database experts and to build your own databases. We discuss the differences between flat file and relational databases, introduce the best public-domain tools for managing databases, and show you how to use them to store and access your data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Do I Understand Sequence Alignment Data?
It's hard to make sense of your data, or make a point, without visualization tools. The extraction of cross sections or subsets of complex multivariate data sets is often required to make sense of biological data. Storing your data in structured databases, which are discussed in Chapter 13, creates the infrastructure for analysis of complex data.
Once you've stored data in an accessible, flexible format, the next step is to extract what is important to you and visualize it. Whether you need to make a histogram of your data or display a molecular structure in three dimensions and watch it move in real time, there are visualization tools that can do what you want. Chapter 14 covers data-analysis and data-visualization tools, from generic plotting packages to domain-specific programs for marking up biological sequence alignments, displaying molecular structures, creating phylogenetic trees, and a host of other purposes.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Do I Write a Program to Align Two Biological Sequences?
An important component of any kind of computational science is knowing when you need to write a program yourself and when you can use code someone else has written. The efficient programmer is a lazy programmer; she never wastes effort writing a program if someone else has already made a perfectly good program available. If you are looking to do something fairly routine, such as aligning two protein sequences, you can be sure that someone else has already written the program you need and that by searching you can probably even find some source code to look at. Similarly, many mathematical and statistical problems can be solved using standard code that is freely available in code libraries. Perl programmers make code that simplifies standard operations available in modules; there are many freely available modules that manage web-related processes, and there are projects underway to create standard modules for handling biological-sequence data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Do I Predict Protein Structure from Sequence?
There are some questions we can't answer for you, and that's one of them; in fact, it's one of the biggest open research questions in computational biology. What we can and do give you are the tools to find information about such problems and others who are working on them, and even, with the proper inspiration, to develop approaches to answering them yourself. Bioinformatics, like any other science, doesn't always provide quick and easy answers to problems.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Questions Can Bioinformatics Answer?
The questions that drive (and fund) bioinformatics research are the same questions humans have been working away at in applied biology for the last few hundred years. How can we cure disease? How can we prevent infection? How can we produce enough food to feed all of humanity? Companies in the business of developing drugs, agricultural chemicals, hybrid plants, plastics and other petroleum derivatives, and biological approaches to environmental remediation, among others, are developing bioinformatics divisions and looking to bioinformatics to provide new targets and to help replace scarce natural resources.
The existence of genome projects implies our intention to use the data they generate. The implicit goals of modern molecular biology are, simply stated, to read the entire genomes of living things, to identify every gene, to match each gene with the protein it encodes, and to determine the structure and function of each protein. Detailed knowledge of gene sequence, protein structure and function, and gene expression patterns is expected to give us the ability to understand how life works at the highest possible resolution. Implicit in this is the ability to manipulate living things with precision and accuracy.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Computational Approaches to Biological Questions
There is a standard range of techniques that are taught in bioinformatics courses. Currently, most of the important techniques are based on one key principle: that sequence and structural homology (or similarity) between molecules can be used to infer structural and functional similarity. In this chapter, we'll give you an overview of the standard computer techniques available to biologists; later in the book, we'll discuss how specific software packages implement these techniques and how you should use them.
Before we go any further, it's essential that you understand some basics of cell and molecular biology. If you're already familiar with DNA and protein structure, genes, and the processes of transcription and translation, feel free to skip ahead to the next section.
The central dogma of molecular biology states that:
DNA acts as a template to replicate itself, DNA is also transcribed into RNA, and RNA is translated into protein.
As you can see, the central dogma sums up the function of the genome in terms of information. Genetic information is conserved and passed on to progeny through the process of replication. Genetic information is also used by the individual organism through the processes of transcription and translation. There are many layers of function, at the structural, biochemical, and cellular levels, built on top of genomic information. But in the end, all of life's functions come back to the information content of the genome.
Put another way, genomic DNA contains the master plan for a living thing. Without DNA, organisms wouldn't be able to replicate themselves. The raw "one-dimensional" sequence of DNA, however, doesn't actually do anything biochemically; it's only information, a blueprint if you will, that's read by the cell's protein synthesizing machinery. DNA sequences are the punch cards; cells are the computers.
DNA is a linear polymer made up of individual chemical units called
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Molecular Biology's Central Dogma
Before we go any further, it's essential that you understand some basics of cell and molecular biology. If you're already familiar with DNA and protein structure, genes, and the processes of transcription and translation, feel free to skip ahead to the next section.
The central dogma of molecular biology states that:
DNA acts as a template to replicate itself, DNA is also transcribed into RNA, and RNA is translated into protein.
As you can see, the central dogma sums up the function of the genome in terms of information. Genetic information is conserved and passed on to progeny through the process of replication. Genetic information is also used by the individual organism through the processes of transcription and translation. There are many layers of function, at the structural, biochemical, and cellular levels, built on top of genomic information. But in the end, all of life's functions come back to the information content of the genome.
Put another way, genomic DNA contains the master plan for a living thing. Without DNA, organisms wouldn't be able to replicate themselves. The raw "one-dimensional" sequence of DNA, however, doesn't actually do anything biochemically; it's only information, a blueprint if you will, that's read by the cell's protein synthesizing machinery. DNA sequences are the punch cards; cells are the computers.
DNA is a linear polymer made up of individual chemical units called nucleotides or bases. The four nucleotides that make up the DNA sequences of living things (on Earth, at least) are adenine, guanine, cytosine, and thymine—designated A, G, C, and T, respectively. The order of the nucleotides in the linear DNA sequence contains the instructions that build an organism. Those instructions are read in processes called replication, transcription, and translation.
The unusual structure of DNA molecules gives DNA special properties. These properties allow the information stored in DNA to be preserved and passed from one cell to another, and thus from parents to their offspring. Two molecules of DNA form a double-helical structure, twining around each other in a regular pattern along their full length—which can be millions of nucleotides. The halves of the double helix are held together by bonds between the nucleotides on each strand. The nucleotides also bond in particular ways: A can pair only with T, and G can pair only with C. Each of these pairs is referred to as a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Biologists Model
Now that we've completed our ultra-short course in cell biology, let's look at how to apply it to problems in molecular biology. One of the most important exercises in biology and bioinformatics is modeling. A model is an abstract way of describing a complicated system. Turning something as complex (and confusing) as a chromosome, or the cycle of cell division, into a simplified representation that captures all the features you are trying to study can be extremely difficult. A model helps us see the larger picture. One feature of a good model is that it makes systems that are otherwise difficult to study easier to analyze using quantitative approaches. Bioinformatics tools rely on our ability to extract relevant parameters from a biological system (be it a single molecule or something as complicated as a cell), describe them quantitatively, and then develop computational methods that use those parameters to compute the properties of a system or predict its behavior.
To help you understand what a model is and what kind of analysis a good model makes possible, let's look at three examples on which bioinformatics methods are based.
In reality, DNA and proteins are complicated 3D molecules, composed of thousands or even millions of atoms bonded together. However, DNA and proteins are both polymers , chains of repeating chemical units (monomers ) with a common backbone holding them together. Each chemical unit in the polymer has two subsets of atoms: a subset of atoms that doesn't vary from monomer to monomer and that makes up the backbone of the polymer, and a subset of atoms that does vary from monomer to monomer.
In DNA, four nucleic acid monomers (A, T, C, and G) are commonly used to build the polymer chain. In proteins, 20 amino acid monomers are used. In a DNA chain, the four nucleic acids can occur in any order, and the order they occur in determines what the DNA does. In a protein, amino acids can occur in any order, and their order determines the protein's fold and function.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Biologists Model
We've mentioned more than once that theoretical modeling provides testable hypotheses, not definitive answers. It sometimes isn't so easy to maintain this distinction, especially with pairwise sequence comparison, which seems to provide such ready answers. Even identification of genes based on sequence similarity ultimately needs to be validated experimentally. It's not sufficient to say that an unknown DNA sequence is similar to the sequence of a gene that has been subject to detailed characterization, so therefore it must have an identical function. The two sequences could be distantly related but have evolved to have different functions. However, it's altogether reasonable to use sequence similarity as the starting point for verification; if sequence homology suggests that an unknown gene is similar to citrate synthases, your first experimental approach might be to test the unknown gene product for citrate synthase activity.
One of the main benefits of using computational tools in biology is that it becomes easier to preselect targets for experimentation in molecular biology and biochemistry. Using everything from sequence profiling methods to geometric and physicochemical analysis of protein structures, researchers can focus narrowly on the parts of a sequence or structure that appear to have some functional significance. Only a decade ago, this focusing might have been done using "shotgun" approaches to site-directed mutagenesis, in which random single-residue mutants of a protein were created and characterized in order to select possible targets. Functional genomics and metabolic reconstruction efforts are beginning to provide biochemists with a framework for narrowing their research focuses as well.
For the researcher focused on developing bioinformatics methods, the discovery of general rules and properties in data is by far the most interesting category of problems that can be addressed using a computer. It's also a diverse category and one we can't give you many rules for. Researchers have found interesting and useful properties in everything from sequence patterns to the separation of atoms in molecular structures and have applied these findings to produce such tools as genefinders, secondary structure prediction tools, profile methods, and homology modeling tools.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Computational Methods Covered in This Book
Molecular biology research is a fast-growing area. The amount and type of data that can be gathered is exploding, and the trend of storing this data in public databases is spilling over from genome sequence to all sorts of other biological datatypes. The information landscape for biologists is changing so rapidly that anything we say in this book is likely to be somewhat behind the times before it even hits the shelves.
Yet, since the inception of the Human Genome Project, a core set of computational approaches has emerged for dealing with the types of data that are currently shared in public databases—DNA, protein sequence, and protein structure. Although databases containing results from new high-throughput molecular biology methods have not yet grown to the extent the sequence databases have, standard methods for analyzing these data have begun to emerge.
While not exhaustive, the following list gives you an overview of the computational methods we address in this book:
Using public databases and data formats
The first key skill for biologists is to learn to use online search tools to find information. Literature searching is no longer a matter of looking up references in a printed index. You can find links to most of the scientific publications you need online. There are central databases that collect reference information so you can search dozens of journals at once. You can even set up "agents" that notify you when new articles are published in an area of interest. Searching the public molecular-biology databases requires the same skills as searching for literature references: you need to know how to construct a query statement that will pluck the particular needle you're looking for out of the database haystack. Tools for searching biochemical literature and sequence databases are introduced in Chapter 6.
Sequence alignment and sequence searching
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Computational Biology Experiment
Computer-based research projects and computational analysis of experimental data must follow the same principles other scientific study do. Your results must clearly answer the question you set out to test, and they must be reproducible by someone else using the same input data and following the same process.
If you're already doing research in experimental biology, you probably have a pretty good understanding of the scientific method. Although your data, your method, and your results are all encoded in computer files rather than sitting on your laboratory bench, the process of designing a computational "experiment" is the same as you are used to.
Although it's easy in these days of automation to simply submit a query to a search engine and use the results without thinking too much about it, you need to understand your method and analyze your results thoroughly in the same way you would when applying a laboratory protocol. Sometimes that's easier said than done. So let's take a walk through the steps involved in defining an experiment in computational biology.
A scientific experiment always begins with a question. A question can be as broad as "what is the catalytic mechanism of protein X?" It's not always possible to answer a complex question about how something works with one experiment. The question needs to be broken down into parts, each of which can be formulated as a hypothesis.
A hypothesis is a statement that is testable by experiment. In the course of solving a problem, you will probably formulate a number of testable statements, some of them trivial and some more complex. For instance, as a first approach to answering the question, "What is the catalytic mechanism of protein X?", you might come up with a preliminary hypothesis such as: "There are amino acids in protein X that are conserved in other proteins that do the same thing as protein X." You can test this hypothesis by using a computer program to align the sequences of as many protein X-type proteins as you can find, and look for amino acids that are identical among all or most of the sequences. Subsequently you'd move to another hypothesis such as: "Some of these conserved amino acids in the protein X family have something to do with the catalytic mechanism." This more complex hypothesis can then be broken down into a number of smaller ones, each of them testable (perhaps by a laboratory experiment, or perhaps by another computational procedure).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Setting Up Your Workstation
In this chapter, we discuss how to set up a workstation running the Linux operating system. Linux is a free, open source version of Unix that makes it possible to turn an ordinary PC into a powerful workstation. By configuring your system with Linux and other open source software, you can have access to a lot of powerful computational biology and bioinformatics tools at a low cost.
In writing this chapter, we encountered a bit of a paradox—in order to get around in Unix you need to have your computer set up, but in order to set up your computer you need to know a few things about Unix. If you don't have much experience with Unix, we strongly suggest that you look through Chapter 4 and Chapter 5 before you set up a Linux workstation of your own. If you're already familiar with the ins and outs of Unix, feel free to skip ahead to Chapter 6.
You are probably accustomed to working with personal computers; you may be familiar with windows interfaces, word processors, and even some data-analysis packages. But if you want to use computers as a serious component in your research, you need to work on computer systems that run under Unix or related multiuser operating systems.
Computer hardware without an operating system is like a dead animal. It isn't going to react, it isn't going to function; it's just going to sit there and look at you with glassy eyes until it rots (or rusts). The operating system breathes life into the inert body of your computer. It handles the low level processes that make hardware work together and provides an environment in which you can run and develop programs. The most important function of the operating system is that it allows you convenient access to your files and programs.
So if the operating system is something you're not supposed to notice, why worry about which one you're using? Why use Unix?
Unix is a powerful operating system for multiuser computer systems. It has been in existence for over 25 years, and during that time has been used primarily in industry and academia, where networked systems and multiuser high-performance computer systems are required. Unix is optimized for tasks that are only fairly recent additions to personal-computer operating systems, or which are still not even available in some PC operating systems: networking with other computers, initiating multiple asynchronous tasks, retaining unique information about the work environments of multiple users, and protecting the information stored by individual users from other users of the system. Unix is the operating system of the World Wide Web; the software that powers the Web was invented in Unix, and many if not most web servers run on Unix servers.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Working on a Unix System
You are probably accustomed to working with personal computers; you may be familiar with windows interfaces, word processors, and even some data-analysis packages. But if you want to use computers as a serious component in your research, you need to work on computer systems that run under Unix or related multiuser operating systems.
Computer hardware without an operating system is like a dead animal. It isn't going to react, it isn't going to function; it's just going to sit there and look at you with glassy eyes until it rots (or rusts). The operating system breathes life into the inert body of your computer. It handles the low level processes that make hardware work together and provides an environment in which you can run and develop programs. The most important function of the operating system is that it allows you convenient access to your files and programs.
So if the operating system is something you're not supposed to notice, why worry about which one you're using? Why use Unix?
Unix is a powerful operating system for multiuser computer systems. It has been in existence for over 25 years, and during that time has been used primarily in industry and academia, where networked systems and multiuser high-performance computer systems are required. Unix is optimized for tasks that are only fairly recent additions to personal-computer operating systems, or which are still not even available in some PC operating systems: networking with other computers, initiating multiple asynchronous tasks, retaining unique information about the work environments of multiple users, and protecting the information stored by individual users from other users of the system. Unix is the operating system of the World Wide Web; the software that powers the Web was invented in Unix, and many if not most web servers run on Unix servers.
Because Unix has been used extensively in universities, where much software for scientific data analysis is developed, you will find a lot of good-quality, interesting scientific software written for Unix systems. Computational biology and bioinformatics researchers are especially likely to have developed software for Unix, since until the mid-1990s, the only workstations able to visualize protein structure data in realtime were Silicon Graphics and Sun Unix workstations.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Setting Up a Linux Workstation
If you are already using an existing Unix/Linux system, feel free to skip this section and go directly to the next.
If you are used to working with Macintosh or PC operating systems, the simplest way to set up a Linux workstation or server is to go out and buy a PC that comes with Linux preinstalled. VA Linux, for example, offers a variety of Intel Pentium-based workstations and servers preconfigured with your choice of several of the most popular Linux distributions.
If you're looking for a complete, self-contained bioinformatics system, Iobion Systems (http://www.iobion.com) is developing Iobion, a ground-breaking bioinformatics network server appliance developed using open source technologies. Iobion is an Intel-based hardware system that comes preinstalled with Linux, Apache web server, a PostgreSQL relational database, the R statistical language, and a comprehensive suite of bioinformatics tools and databases. The system serves these scientific applications to web clients on a local intranet or over the Internet. The applications include tools for microarray data analysis complete with a microarray database, sequence analysis and annotation tools, local copies of the public sequence databases, a peer-to-peer networking tool for sharing biological data, and advanced biological lab tools. Iobion promotes and adheres to open standards in bioinformatics.
If you already have a PC, your next choice is to buy a prepackaged version of Linux, such as those offered by Red Hat, Debian, or SuSE. These prepackaged distributions have several advantages: they have an easy-to-use graphical interface for installing Linux, all the software they include is packed into package manager (for Red Hat, it's the Red Hat Package Manager or RPM) archives or similar easily extracted formats, and they often contain a large number of "extras" that are easier to install from the distribution disk using a package manager than they are if you install them by hand.
That said, let's assume you've gone out and bought something like the current version of Red Hat. You'll be asked if you want to do a workstation installation, a server installation, or a custom installation. What do these choices mean?
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How to Get Software Working
You've gone out and done the research and found a bioinformatics software package you want to install on your own computer. Now what do you do?
When you look for Unix software on the Web, you will find that it's distributed in a number of different formats. Each type of software distribution requires a different type of handling. Some are very simple to install, almost like installing software on a Mac or PC. On the other hand, some software is distributed in a rudimentary form that requires your active intervention to get it running. In order to get this software working, you may have to compile it by hand or even modify the directions that are sent to the compiler so that the program will work on your system. Compiling is the process of converting software from its human-readable form, source code, to a machine-readable executable form. A compiler is the program that performs this conversion.
Software that's difficult to install isn't necessarily bad software. It may be high-quality software from a research group that doesn't have the resources to produce an easy-to-use installation kit. While this is becoming less common, it's still common enough that you will need to know some things about compiling software.
Software is often distributed as a tar archive, which is short for "tape archive." We discuss tar and other file-compression options in more detail in Chapter 5. Not coincidentally, these archives are one of the most common ways to distribute Unix software on the Internet. tar allows you to download one file that contains the complete image of the developer's working software installation and unpack it right back into the correct subdirectories. If tar is used with the p option, file permissions can even be preserved. This ensures that, if the developer has done a competent job of packing all the required files in the tar archive, you can compile the software relatively easily.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Software Is Needed?
New computational biology software is always popping up, but through a couple of decades of collective experience, a consensus set of tools and methods has emerged. Many scientists are familiar with standard commercial packages for sequence analysis, such as GCG, and for protein structure analysis, such as Quanta or Insight. For beginners, these packages provide an integrated interface to a variety of tools.
Commercial software packages for sequence analysis integrate a number of functions, including mapping and fragment assembly, database searching, gene discovery, pairwise and multiple sequence analysis, motif identification, and evolutionary analysis. One caveat is that these software packages can be prohibitively expensive. It can be difficult, especially for educational institutions and research groups on a limited budget, to purchase commercial software and pay the annual costs for license maintenance (which can be in the many thousands of dollars).
A related cost issue is that many commercial software packages, especially those for macromolecular structure analysis, don't yet run on consumer PCs. These packages were originally developed for high-end workstations when these workstations were the only computers with sufficient graphics capability to display protein structures. Although these days most home computers have high-powered graphics cards, the makers of commercial molecular modeling software have been slow to keep up.
While commercial computational biology software packages can be excellent and easy to use, they often seem to lag at least a couple of years behind cutting-edge method development. The company that produces a commercial software package usually commits to only one method for each type of tool, buys it at a particular phase in its development cycle, focuses on turning it into a commercially viable product, and may not incorporate developments in the method into their package in a timely fashion, or at all.
On the other hand, while academic software is usually on the cutting edge, it can be poorly written and hard to install. Documentation (beyond the published paper that describes the software) may be nonexistent. Graphical user interfaces in academic software packages are often rudimentary, which can be aggravating for the beginning user.
Additional content appearing in this secti