book

BLAST

by Ian Korf, Mark Yandell, Joseph Bedell

July 2003

Intermediate to advanced

368 pages

13h 44m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Table of Contents (1/2)
Table of Contents (2/2)
Foreword
Preface
Audience for This Book
Structure of This Book
A Little Math, a Little Perl
Conventions Used in This BookURLs Referenced in This BookComments and Questions
Acknowledgments
IanMarkJoey
Part I
Hello BLAST
What Is BLAST?
Using NCBI-BLAST (1/2)
Choosing the BLAST ProgramEntering the Query SequenceChoosing the Database to SearchChoosing the Parameters of the SearchChoosing the FormatSubmitting the SearchViewing the Results

Using NCBI-BLAST (2/2)
Alternate Output Formats
Alternate Alignment Views
The Next Step
Further Reading
Part II
Biological Sequences
The Central Dogma of Molecular BiologyDNARNAProtein
The Genetic Code
Evolution (1/2)
MutationNatural SelectionGenetic DriftThe Neutral Theory of Evolution
Evolution (2/2)
Molecular ClocksHomology, Phylogeny, and TreesThe Tree of Life
Genomes and Genes
Prokaryotic GenesEukaryotic GenesTranscriptsRepeats
Pseudogenes
Biological Sequences and Similarity
Further Reading
Sequence Alignment
Global Alignment: Needleman-WunschInitializationFillTrace-Back
Local Alignment: Smith-Waterman
Dynamic Programming
Algorithmic ComplexityGlobal Versus Local
Variations
Gap ModificationsReduced Memory
Aligning Transcripts to Genomic Sequence
Final ThoughtsFurther Reading
Sequence Similarity
Introduction to Information Theory
Amino Acid Similarity
Scoring Matrices
PAM and BLOSUM Matrices
Target Frequencies, lambda, and HLambdaRelative EntropyMatch-Mismatch Scoring
Sequence Similarity
Karlin-Altschul Statistics
Gapped AlignmentsLength Correction
Sum Statistics and Sum Scores
Converting a Sum Score to a Sum Probability
Probability Versus Expectation
Further Reading
Part III
BLAST
The Five BLAST Programs
The BLAST Algorithm (1/3)
SeedingImplementation details
The BLAST Algorithm (2/3)
ExtensionImplementation detailsEvaluation
The BLAST Algorithm (3/3)
Implementation details
Further Reading
Anatomy of a BLAST Report
Basic Structure
Alignments (1/2)
BLASTPBLASTNBLASTXTBLASTNTBLASTXAlignment Groups
Alignments (2/2)
A BLAST Statistics Tutorial
Basic BLAST StatisticsActual Versus Effective LengthsThe Raw Score and Bit Score
The Expect of an HSP
The WU-BLAST P-ValueSum StatisticsAn Expect(n) Means That Sum Statistics Were AppliedSum Statistics Are Pair-Wise in Their FocusThe Sum ScoreEffective Length of a BLASTX Query
Calculating a Sum Score
Calculating the Pair-Wise Sum P-ValueCorrecting for Multiple TestsCorrecting for Database SizeFrame- and Size-Corrected Expects
Using Statistics to Understand BLAST Results
Where Did My Oligo Go?Karlin-Altschul Statistics as a Tool for Further Investigation
What It All Means
20 Tips to Improve YourBLASTSearches
8.1 Don’t Use the Default Parameters8.2 Treat BLAST Searches as Scientific Experiments
8.3 Perform Controls, Especially in theTwilightZone
8.4 View BLAST Reports Graphically
8.5 Use the Karlin-Altschul Equation toDesignExperiments
8.6 When Troubleshooting, Read the Footer First
8.7 Know When to Use Complexity Filters
8.8 Mask Repeats in Genomic DNA
8.9 Segment Large Genomic Sequences
8.10 Be Skeptical of Hypothetical Proteins
8.11 Expect Contaminants in EST Databases
8.12 Use Caution When Searching Raw Sequencing Reads
8.13 Look for Stop Codons and Frame-Shifts to find Pseudo-Genes8.14 Consider Using Ungapped Alignment for BLASTX, TBLASTN, and TBLASTX
8.15 Look for Gaps in Coverage as a Sign ofMissedExons
8.16 Parse BLAST Reports with Bioperl
8.17 Perform Pilot Experiments
8.18 Examine Statistical Outliers8.19 Use links and topcomboN to Make Sense of Alignment Groups8.20 How to Lie with BLAST Statistics
BLAST Protocols
BLASTN Protocols (1/3)
Mapping Oligos to a GenomeApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsMapping Nonspliced DNA to a GenomeApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsMapping a cDNA/EST to a GenomeApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsCross-Species Sequence ExplorationApproachNCBI-BLAST parameters
BLASTN Protocols (2/3)
WU-BLAST parametersExpected resultsOptimizations and variationsAnnotating Genomic DNA with ESTsApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsTranscript Clustering and ExtensionApproachNCBI-BLAST parametersWU-BLAST parametersExpected results
BLASTN Protocols (3/3)
Clustering with blastclustApproachEST clusteringShotgun sequencesExpected resultsVector ClippingApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsRepeat MaskingApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variations
Contaminant Detection
BLASTP ProtocolsThe Standard BLASTP SearchApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsFast, Insensitive SearchApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsSlow, Sensitive SearchApproach
NCBI-BLAST parameters
WU-BLAST parametersExpected resultsOptimizations and variationsBLASTX ProtocolsGene Finding in Genomic DNAApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsAnnotating ESTs (and Shotgun Sequence)ApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsSuper-Fast BLASTXApproach
NCBI-BLAST parameters
WU-BLAST parametersWU-BLAST 1.4 parametersExpected resultsOptimizations and variationsTBLASTN ProtocolsMapping a Protein to a GenomeApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsMining ESTs (and Shotgun DNA) for Protein SimilaritiesApproachNCBI-BLAST parametersWU-BLAST parameters
Expected results
Optimizations and variationsTBLASTX ProtocolsPreventing Stop CodonsFinding Undocumented Genes in Genomic DNAApproachNCBI-BLASTWU-BLASTExpected resultsOptimizations and variationsTranscript-Transcript TBLASTXApproachNCBI-BLASTWU-BLASTExpected resultsOptimizations and variations
Part IV
Installation and Command-Line Tutorial
NCBI-BLAST InstallationUnix InstallationFiles and directoriesThe .ncbirc fileSetting the PATH and BLASTDB environment variablesWindows InstallationThe ncbi.ini fileSetting the PATH environment variableMacintosh OS X Installation
Macintosh OS 9 Installation
WU-BLAST InstallationExpanding the tarballFiles and DirectoriesExecutablesEnvironment VariablesSetting Resource Limits with /etc/sysblast
Command-Line Tutorial (1/4)
NCBI-BLASTformatdbblastnmegablastblastpblastxtblastntblastx
Command-Line Tutorial (2/4)
bl2seqfastacmdPSI-BLASTPHI-BLASTEnvironment variables and .ncbirc
Command-Line Tutorial (3/4)
WU-BLASTxdformatblastnblastpblastxtblastntblastxxdget
Command-Line Tutorial (4/4)
nrdb and patdb
Environment variables
Editing Scoring Matrices
BLAST Databases
FASTA FilesNCBI Identifier FormatCompound identifiersConcatenated definition linesDescriptions
BLAST Databases
Large DatabasesLarge NCBI databasesLarge WU-BLAST databasesVirtual DatabasesAlias DatabasesRemoving RedundancyStandard BLAST Databases
Custom BLAST Databases
Sequence Databases (1/2)International Nucleotide Sequence DatabaseDatabase GrowthFlat FilesACCESSION, LOCUS, VERSION, and GI
Sequence Databases (2/2)
DEFINITION, KEYWORDS, and SOURCEFEATURESOther Common Databases
Sequence Database Management Strategies (1/2)
Queries, Indexes, and ReportsLocal Database ConsiderationsRetrieving FASTA Files by AccessionFlat File Indexing
Sequence Database Management Strategies (2/2)
Commercial Sequence Management SoftwareTools on the Internet
Hardware and Software Optimizations
The Persistence of MemoryBLAST Pipelines and Caching
CPUs and Computer Architecture
Multiprocessor Computers
Operating Systems and CompilersCompute ClustersRemote Versus Local DatabasesRemote databases
Local databases
Distributed Resource Management
Software Tricks
Multiplexing/Query PackingQuery ChoppingDatabase SplittingSerial BLAST Searching
Optimized NCBI-BLAST
Apple/Genentech BLASTParacel-BLAST and BlastMachineTimeLogic Tera-BLAST
Part V
NCBI-BLAST Reference
Usage StatementsCommand-Line Syntax
blastall Parameters (1/2)
-a [integer]-A [integer]-b [integer]-B [integer]-d [database]-D [1..23]-e [real number]-E [integer]-f [integer]-F [T/F], -F [string]-g [T/F]-G [integer]-i [input file]-I [T/F]-J [T/F]-K [integer]
blastall Parameters (2/2)
-l [file]-L [string]-m [0..11]-M [matrix file]-n [T/F]-o [output file]-p [program name]-P [0/1]-q [negative integer]-Q [1..23]-r [integer]-R [checkpoint file]-S [1..3]-t [integer]-T [T/F]-v [integer]-w [integer]-W [integer]-X [integer]-y [integer]-Y [real number]-z [real number]
-Z [integer]
formatdb Parameters-B [file]-F [file]-i [file]-l [file]-L [file]-n [string]-o [T/F]-p [T/F]
-s [T/F]
-t [string]-v [integer]-V [T/F]fastacmd Parameters-a [T/F]-c [T/F]-d [string]-D [T/F]-i [file]-I-l [integer]-L [integer],[integer]-o [file]-p [T/F/G]-P [integer]-s [string]
-S [1..2]
-t [T/F]-T [T/F]megablast Parameters (1/2)-a [integer]-A [integer]-b [integer]-d [string]-D [0..3]-e [real number]-E [integer]-f [T/F]-F [T/F] [string]-G [integer]-H [integer]-i [file]-I [T/F]-l [file]-L [string]-m [0..11]-M [integer]-n [T/F]-N [0,1,2]-o [file]
megablast Parameters (2/2)
-p [real number]-P [integer]-q [negative integer]-Q [file]-r [integer]-R [T/F]-s [integer]-S [0..3]-t [16,18,21]-T [T/F]-U [T/F]-v [integer]-W [integer]-X [integer]-y [integer]
-z [real number]
-Z [integer]bl2seq Parameters-a [file]-A [T/F]-d [real number]-D [0/1]-e [real number]-E [integer]-F [T/F] [string]-g [T/F]-G [integer]-i [file]-I [integer],[integer]-j [file]-J [integer],[integer]-m [T/F]-M [string]-o [file]-p [string]-q [negative integer]-r [integer]-S [1..3]-t [integer]-T [T/F]-U [T/F]-W [integer]-X [integer]-Y [real number]
blastpgp Parameters (PSI-BLAST andPHIBLAST) (1/2)
PSI-BLASTPHI-BLAST-a [integer]-A [integer]-b [integer]-B [file]-c [integer]-C [file]-d [string]-e [real]-E [integer]-f [integer]-F [string]-g [T/F]-G [integer]-h [real number]-H [integer]-i [file]-I [T/F]
blastpgp Parameters (PSI-BLAST andPHIBLAST) (2/2)
-j [integer]-J [T/F]-k [file]-K [integer]-l [string]-L [integer]-m [0..9]-M [string]-N [real number]-o [file]-O [file]-p [string]-Q [file]-R [file]-s [T/F]-S [integer]-t [T/F]-T [T/F]-U [T/F]-v [integer]
-W [1..3]
-X [integer]-y [real number]-Y [real number]-z [real number]-Z [integer]blastclust Parameters-a [integer]-b [T/F]-c [file]-C [T/F]-d [file]-e [T/F]-i [file]-l [file]-L [real number]-p [T/F]-r [file]-s [file]-v [file]-W [integer]
WU-BLAST Reference
Usage Statements
Command-Line Syntax
WU-BLAST Parameters (1/3)
altscore=[string]B=[integer]bottomcpus=[integer]dbrecmax=[integer]dbrecmin=[integer]E=[number]E2=[number]echofiltererrorsfilter=[string]gapE2=[number]gapH=[number]gapK=[number]gapL=[number]gapS2=[integer]gapsepqmax=[int]gapsepsmax=[int]gapXgigolf=[number]golmax=[integer]gspmax=[integer]
WU-BLAST Parameters (2/3)
H=[number]hspmax=[integer]hitdist=[integer]hspsepqmax=[int]hspsepsmax=[int]K=[number]kapL=[number]lcfilterlcmasklinksM=[integer]maskextra=[integer]matrix=[file]N=[integer]nogapnonnegoknosegsnotesnovalidctxoknwlen=[integer]nwstart=[integer]o=[file]olf=[number]olmax=[integer]postswQ=[integer]qoffset=[integer]qrecmax=[integer]Qrecmin=[integer]
WU-BLAST Parameters (3/3)
R=[integer]restestS=[integer]mS2=[integer]seqtestspan, span1, span2T=[integer]toptopcomboN=[integer]V=[integer]warningswink=[integer]
wordmask=[method]
W=[integer]X=[integer]Y=[number]Z=[number]xdformat Parameters-A [0..2]-a [database]-c [character]-D [integer]-d [string]-e [file]-G-i-K [integer]-k-L [number]-l [number]-M [number]-O [4..8]-P [integer]-q [0..3]-r-T [string]
-v
-Xxdget Parameters-A [n, 0]-a [integer]-b [integer]-d-D [integer]-e [file]-F-f-G-o [file]-N [0, n]-P [integer]-r-T [string]-t
Part VI
NCBI Display Formats
Brief DescriptionsDetailed Descriptions and ExamplesOption 0: Pairwise AlignmentsQuery-Anchored AlignmentsOption 1: Query-Anchored Showing IdentitiesOption 2: Query-Anchored, No IdentitiesOption 3: Flat Query-Anchored Showing IdentitiesOption 4: Flat Query-Anchored, No IdentitiesOption 5: Query-Anchored, No Identities, and Blunt EndsOption 6: Flat Query-Anchored, No Identities, and Blunt EndsOption 7: XMLOption 8: Tabular, Without Comment Lines
Option 9: Tabular, with Comment Lines
Option 10: ASN.1 Text FormatOption 11: ASN.1 Binary Format
Nucleotide Scoring Schemes
NCBI-BLAST Scoring Schemes
NCBI-BLAST Matrices and Gap Costs
blast-imager.pl
blast2table.pl
Glossary (1/2)
Glossary (2/2)
Index (1/5)
Index (2/5)
Index (3/5)
Index (4/5)
Index (5/5)

Content preview from BLAST

This is the Title of the Book, eMatter Edition

Chapter 4: Sequence Similarity

Sequence Similarity

Sequence similarity is a simple extension of amino acid or nucleotide similarity. To

determine it, sum up the individual pair-wise scores in an alignment. For example,

the raw score of the following BLAST alignment under the BLOSUM62 matrix is 72.

Converting 72 to a normalized score is as simple as multiplying by lambda. (Note

that for BLAST statistical calculations, the normalized score is λS – lnk.)

Query: 885 QCPVCHKKYSNALVLQQHIRLHTGE 909

+C VC K ++ L++H RLHTGE

Sbjct: 267 ECDVCSKSFTTKYFLKKHKRLHTGE 291

Recall from Chapter 3 that the score of each pair of letters is considered indepen-

dently from the rest of the alignment. This is the same idea. There is a convenient

synergy between alignment algorithms and alignment scores. However, when treat-

ing the letters independently of one another, you lose contextual information. Can

you assume that the probability of A followed by G is the same as the probability of

G followed by A? In a natural language such as English, you know that this doesn’t

make sense. In English, Q is always followed by U. If you treat these letters indepen-

dently, you lose this restriction. The context rules for biological sequences aren’t as

strict as for English, but there are tendencies. For example, low

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0596002998Catalog Page Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

BLAST

by Ian Korf, Mark Yandell, Joseph Bedell

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.