book

BLAST

by Ian Korf, Mark Yandell, Joseph Bedell

July 2003

Intermediate to advanced

368 pages

13h 44m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Table of Contents (1/2)
Table of Contents (2/2)
Foreword
Preface
Audience for This Book
Structure of This Book
A Little Math, a Little Perl
Conventions Used in This BookURLs Referenced in This BookComments and Questions
Acknowledgments
IanMarkJoey
Part I
Hello BLAST
What Is BLAST?
Using NCBI-BLAST (1/2)
Choosing the BLAST ProgramEntering the Query SequenceChoosing the Database to SearchChoosing the Parameters of the SearchChoosing the FormatSubmitting the SearchViewing the Results

Using NCBI-BLAST (2/2)
Alternate Output Formats
Alternate Alignment Views
The Next Step
Further Reading
Part II
Biological Sequences
The Central Dogma of Molecular BiologyDNARNAProtein
The Genetic Code
Evolution (1/2)
MutationNatural SelectionGenetic DriftThe Neutral Theory of Evolution
Evolution (2/2)
Molecular ClocksHomology, Phylogeny, and TreesThe Tree of Life
Genomes and Genes
Prokaryotic GenesEukaryotic GenesTranscriptsRepeats
Pseudogenes
Biological Sequences and Similarity
Further Reading
Sequence Alignment
Global Alignment: Needleman-WunschInitializationFillTrace-Back
Local Alignment: Smith-Waterman
Dynamic Programming
Algorithmic ComplexityGlobal Versus Local
Variations
Gap ModificationsReduced Memory
Aligning Transcripts to Genomic Sequence
Final ThoughtsFurther Reading
Sequence Similarity
Introduction to Information Theory
Amino Acid Similarity
Scoring Matrices
PAM and BLOSUM Matrices
Target Frequencies, lambda, and HLambdaRelative EntropyMatch-Mismatch Scoring
Sequence Similarity
Karlin-Altschul Statistics
Gapped AlignmentsLength Correction
Sum Statistics and Sum Scores
Converting a Sum Score to a Sum Probability
Probability Versus Expectation
Further Reading
Part III
BLAST
The Five BLAST Programs
The BLAST Algorithm (1/3)
SeedingImplementation details
The BLAST Algorithm (2/3)
ExtensionImplementation detailsEvaluation
The BLAST Algorithm (3/3)
Implementation details
Further Reading
Anatomy of a BLAST Report
Basic Structure
Alignments (1/2)
BLASTPBLASTNBLASTXTBLASTNTBLASTXAlignment Groups
Alignments (2/2)
A BLAST Statistics Tutorial
Basic BLAST StatisticsActual Versus Effective LengthsThe Raw Score and Bit Score
The Expect of an HSP
The WU-BLAST P-ValueSum StatisticsAn Expect(n) Means That Sum Statistics Were AppliedSum Statistics Are Pair-Wise in Their FocusThe Sum ScoreEffective Length of a BLASTX Query
Calculating a Sum Score
Calculating the Pair-Wise Sum P-ValueCorrecting for Multiple TestsCorrecting for Database SizeFrame- and Size-Corrected Expects
Using Statistics to Understand BLAST Results
Where Did My Oligo Go?Karlin-Altschul Statistics as a Tool for Further Investigation
What It All Means
20 Tips to Improve YourBLASTSearches
8.1 Don’t Use the Default Parameters8.2 Treat BLAST Searches as Scientific Experiments
8.3 Perform Controls, Especially in theTwilightZone
8.4 View BLAST Reports Graphically
8.5 Use the Karlin-Altschul Equation toDesignExperiments
8.6 When Troubleshooting, Read the Footer First
8.7 Know When to Use Complexity Filters
8.8 Mask Repeats in Genomic DNA
8.9 Segment Large Genomic Sequences
8.10 Be Skeptical of Hypothetical Proteins
8.11 Expect Contaminants in EST Databases
8.12 Use Caution When Searching Raw Sequencing Reads
8.13 Look for Stop Codons and Frame-Shifts to find Pseudo-Genes8.14 Consider Using Ungapped Alignment for BLASTX, TBLASTN, and TBLASTX
8.15 Look for Gaps in Coverage as a Sign ofMissedExons
8.16 Parse BLAST Reports with Bioperl
8.17 Perform Pilot Experiments
8.18 Examine Statistical Outliers8.19 Use links and topcomboN to Make Sense of Alignment Groups8.20 How to Lie with BLAST Statistics
BLAST Protocols
BLASTN Protocols (1/3)
Mapping Oligos to a GenomeApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsMapping Nonspliced DNA to a GenomeApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsMapping a cDNA/EST to a GenomeApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsCross-Species Sequence ExplorationApproachNCBI-BLAST parameters
BLASTN Protocols (2/3)
WU-BLAST parametersExpected resultsOptimizations and variationsAnnotating Genomic DNA with ESTsApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsTranscript Clustering and ExtensionApproachNCBI-BLAST parametersWU-BLAST parametersExpected results
BLASTN Protocols (3/3)
Clustering with blastclustApproachEST clusteringShotgun sequencesExpected resultsVector ClippingApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsRepeat MaskingApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variations
Contaminant Detection
BLASTP ProtocolsThe Standard BLASTP SearchApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsFast, Insensitive SearchApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsSlow, Sensitive SearchApproach
NCBI-BLAST parameters
WU-BLAST parametersExpected resultsOptimizations and variationsBLASTX ProtocolsGene Finding in Genomic DNAApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsAnnotating ESTs (and Shotgun Sequence)ApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsSuper-Fast BLASTXApproach
NCBI-BLAST parameters
WU-BLAST parametersWU-BLAST 1.4 parametersExpected resultsOptimizations and variationsTBLASTN ProtocolsMapping a Protein to a GenomeApproachNCBI-BLAST parametersWU-BLAST parametersExpected resultsOptimizations and variationsMining ESTs (and Shotgun DNA) for Protein SimilaritiesApproachNCBI-BLAST parametersWU-BLAST parameters
Expected results
Optimizations and variationsTBLASTX ProtocolsPreventing Stop CodonsFinding Undocumented Genes in Genomic DNAApproachNCBI-BLASTWU-BLASTExpected resultsOptimizations and variationsTranscript-Transcript TBLASTXApproachNCBI-BLASTWU-BLASTExpected resultsOptimizations and variations
Part IV
Installation and Command-Line Tutorial
NCBI-BLAST InstallationUnix InstallationFiles and directoriesThe .ncbirc fileSetting the PATH and BLASTDB environment variablesWindows InstallationThe ncbi.ini fileSetting the PATH environment variableMacintosh OS X Installation
Macintosh OS 9 Installation
WU-BLAST InstallationExpanding the tarballFiles and DirectoriesExecutablesEnvironment VariablesSetting Resource Limits with /etc/sysblast
Command-Line Tutorial (1/4)
NCBI-BLASTformatdbblastnmegablastblastpblastxtblastntblastx
Command-Line Tutorial (2/4)
bl2seqfastacmdPSI-BLASTPHI-BLASTEnvironment variables and .ncbirc
Command-Line Tutorial (3/4)
WU-BLASTxdformatblastnblastpblastxtblastntblastxxdget
Command-Line Tutorial (4/4)
nrdb and patdb
Environment variables
Editing Scoring Matrices
BLAST Databases
FASTA FilesNCBI Identifier FormatCompound identifiersConcatenated definition linesDescriptions
BLAST Databases
Large DatabasesLarge NCBI databasesLarge WU-BLAST databasesVirtual DatabasesAlias DatabasesRemoving RedundancyStandard BLAST Databases
Custom BLAST Databases
Sequence Databases (1/2)International Nucleotide Sequence DatabaseDatabase GrowthFlat FilesACCESSION, LOCUS, VERSION, and GI
Sequence Databases (2/2)
DEFINITION, KEYWORDS, and SOURCEFEATURESOther Common Databases
Sequence Database Management Strategies (1/2)
Queries, Indexes, and ReportsLocal Database ConsiderationsRetrieving FASTA Files by AccessionFlat File Indexing
Sequence Database Management Strategies (2/2)
Commercial Sequence Management SoftwareTools on the Internet
Hardware and Software Optimizations
The Persistence of MemoryBLAST Pipelines and Caching
CPUs and Computer Architecture
Multiprocessor Computers
Operating Systems and CompilersCompute ClustersRemote Versus Local DatabasesRemote databases
Local databases
Distributed Resource Management
Software Tricks
Multiplexing/Query PackingQuery ChoppingDatabase SplittingSerial BLAST Searching
Optimized NCBI-BLAST
Apple/Genentech BLASTParacel-BLAST and BlastMachineTimeLogic Tera-BLAST
Part V
NCBI-BLAST Reference
Usage StatementsCommand-Line Syntax
blastall Parameters (1/2)
-a [integer]-A [integer]-b [integer]-B [integer]-d [database]-D [1..23]-e [real number]-E [integer]-f [integer]-F [T/F], -F [string]-g [T/F]-G [integer]-i [input file]-I [T/F]-J [T/F]-K [integer]
blastall Parameters (2/2)
-l [file]-L [string]-m [0..11]-M [matrix file]-n [T/F]-o [output file]-p [program name]-P [0/1]-q [negative integer]-Q [1..23]-r [integer]-R [checkpoint file]-S [1..3]-t [integer]-T [T/F]-v [integer]-w [integer]-W [integer]-X [integer]-y [integer]-Y [real number]-z [real number]
-Z [integer]
formatdb Parameters-B [file]-F [file]-i [file]-l [file]-L [file]-n [string]-o [T/F]-p [T/F]
-s [T/F]
-t [string]-v [integer]-V [T/F]fastacmd Parameters-a [T/F]-c [T/F]-d [string]-D [T/F]-i [file]-I-l [integer]-L [integer],[integer]-o [file]-p [T/F/G]-P [integer]-s [string]
-S [1..2]
-t [T/F]-T [T/F]megablast Parameters (1/2)-a [integer]-A [integer]-b [integer]-d [string]-D [0..3]-e [real number]-E [integer]-f [T/F]-F [T/F] [string]-G [integer]-H [integer]-i [file]-I [T/F]-l [file]-L [string]-m [0..11]-M [integer]-n [T/F]-N [0,1,2]-o [file]
megablast Parameters (2/2)
-p [real number]-P [integer]-q [negative integer]-Q [file]-r [integer]-R [T/F]-s [integer]-S [0..3]-t [16,18,21]-T [T/F]-U [T/F]-v [integer]-W [integer]-X [integer]-y [integer]
-z [real number]
-Z [integer]bl2seq Parameters-a [file]-A [T/F]-d [real number]-D [0/1]-e [real number]-E [integer]-F [T/F] [string]-g [T/F]-G [integer]-i [file]-I [integer],[integer]-j [file]-J [integer],[integer]-m [T/F]-M [string]-o [file]-p [string]-q [negative integer]-r [integer]-S [1..3]-t [integer]-T [T/F]-U [T/F]-W [integer]-X [integer]-Y [real number]
blastpgp Parameters (PSI-BLAST andPHIBLAST) (1/2)
PSI-BLASTPHI-BLAST-a [integer]-A [integer]-b [integer]-B [file]-c [integer]-C [file]-d [string]-e [real]-E [integer]-f [integer]-F [string]-g [T/F]-G [integer]-h [real number]-H [integer]-i [file]-I [T/F]
blastpgp Parameters (PSI-BLAST andPHIBLAST) (2/2)
-j [integer]-J [T/F]-k [file]-K [integer]-l [string]-L [integer]-m [0..9]-M [string]-N [real number]-o [file]-O [file]-p [string]-Q [file]-R [file]-s [T/F]-S [integer]-t [T/F]-T [T/F]-U [T/F]-v [integer]
-W [1..3]
-X [integer]-y [real number]-Y [real number]-z [real number]-Z [integer]blastclust Parameters-a [integer]-b [T/F]-c [file]-C [T/F]-d [file]-e [T/F]-i [file]-l [file]-L [real number]-p [T/F]-r [file]-s [file]-v [file]-W [integer]
WU-BLAST Reference
Usage Statements
Command-Line Syntax
WU-BLAST Parameters (1/3)
altscore=[string]B=[integer]bottomcpus=[integer]dbrecmax=[integer]dbrecmin=[integer]E=[number]E2=[number]echofiltererrorsfilter=[string]gapE2=[number]gapH=[number]gapK=[number]gapL=[number]gapS2=[integer]gapsepqmax=[int]gapsepsmax=[int]gapXgigolf=[number]golmax=[integer]gspmax=[integer]
WU-BLAST Parameters (2/3)
H=[number]hspmax=[integer]hitdist=[integer]hspsepqmax=[int]hspsepsmax=[int]K=[number]kapL=[number]lcfilterlcmasklinksM=[integer]maskextra=[integer]matrix=[file]N=[integer]nogapnonnegoknosegsnotesnovalidctxoknwlen=[integer]nwstart=[integer]o=[file]olf=[number]olmax=[integer]postswQ=[integer]qoffset=[integer]qrecmax=[integer]Qrecmin=[integer]
WU-BLAST Parameters (3/3)
R=[integer]restestS=[integer]mS2=[integer]seqtestspan, span1, span2T=[integer]toptopcomboN=[integer]V=[integer]warningswink=[integer]
wordmask=[method]
W=[integer]X=[integer]Y=[number]Z=[number]xdformat Parameters-A [0..2]-a [database]-c [character]-D [integer]-d [string]-e [file]-G-i-K [integer]-k-L [number]-l [number]-M [number]-O [4..8]-P [integer]-q [0..3]-r-T [string]
-v
-Xxdget Parameters-A [n, 0]-a [integer]-b [integer]-d-D [integer]-e [file]-F-f-G-o [file]-N [0, n]-P [integer]-r-T [string]-t
Part VI
NCBI Display Formats
Brief DescriptionsDetailed Descriptions and ExamplesOption 0: Pairwise AlignmentsQuery-Anchored AlignmentsOption 1: Query-Anchored Showing IdentitiesOption 2: Query-Anchored, No IdentitiesOption 3: Flat Query-Anchored Showing IdentitiesOption 4: Flat Query-Anchored, No IdentitiesOption 5: Query-Anchored, No Identities, and Blunt EndsOption 6: Flat Query-Anchored, No Identities, and Blunt EndsOption 7: XMLOption 8: Tabular, Without Comment Lines
Option 9: Tabular, with Comment Lines
Option 10: ASN.1 Text FormatOption 11: ASN.1 Binary Format
Nucleotide Scoring Schemes
NCBI-BLAST Scoring Schemes
NCBI-BLAST Matrices and Gap Costs
blast-imager.pl
blast2table.pl
Glossary (1/2)
Glossary (2/2)
Index (1/5)
Index (2/5)
Index (3/5)
Index (4/5)
Index (5/5)

Content preview from BLAST

This is the Title of the Book, eMatter Edition

Sum Statistics and Sum Scores

Note that the expected HSP (high scoring pair) length is dependent on the search

space (m*n) and the relative entropy of the scoring scheme, H, so it varies from

search to search.

To take edge effects into account when calculating an Expect, the expected HSP

length is subtracted from the actual length of the query, m, and the actual number of

residues in the database, n, to give their effective lengths, usually denoted by m´ and

n´, respectively (see Equations 4-12 and 4-13).

In a large search space, the expected HSP length may be greater than the length of

the query, resulting in a negative effective length, m´. In practice, if the effective

length is less than 1/k, it is set to 1/k, as doing so cancels the contribution of the

short sequence to the Expect; setting for example, gives , a for-

mulation independent of m’.

Unfortunately, effective lengths of less than aren’t uncommon today. Because

, the large size on many sequence databases can result in large expected HSP

lengths. In fact it’s not uncommon to see expected HSP lengths approaching 200

when searching some of the larger protein databases. Keep in mind that the average

protein is ~300 amino acids long; thus, for many searches, m´ is being set to 1/k rou-

tinely. Recent work by S.F. Altschul

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0596002998Catalog Page Errata

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

BLAST

by Ian Korf, Mark Yandell, Joseph Bedell

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.