book

Mastering Python for Bioinformatics

by Ken Youens-Clark

May 2021

Intermediate to advanced

454 pages

10h 42m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Who Should Read This?Programming Style: Why I Avoid OOP and ExceptionsStructureTest-Driven DevelopmentUsing the Command Line and Installing PythonGetting the Code and TestsInstalling ModulesInstalling the new.py ProgramWhy Did I Write This Book?Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Getting StartedCreating the Program Using new.pyUsing argparseTools for Finding Errors in the CodeIntroducing Named TuplesAdding Types to Named TuplesRepresenting the Arguments with a NamedTupleReading Input from the Command Line or a FileTesting Your ProgramRunning the Program to Test the OutputSolution 1: Iterating and Counting the Characters in a StringCounting the NucleotidesWriting and Verifying a SolutionAdditional SolutionsSolution 2: Creating a count() Function and Adding a Unit TestSolution 3: Using str.count()Solution 4: Using a Dictionary to Count All the CharactersSolution 5: Counting Only the Desired BasesSolution 6: Using collections.defaultdict()Solution 7: Using collections.Counter()Going FurtherReview
Getting StartedDefining the Program’s ParametersDefining an Optional ParameterDefining One or More Required Positional ParametersUsing nargs to Define the Number of ArgumentsUsing argparse.FileType() to Validate File ArgumentsDefining the Args ClassOutlining the Program Using PseudocodeIterating the Input FilesCreating the Output FilenamesOpening the Output FilesWriting the Output SequencesPrinting the Status ReportUsing the Test SuiteSolutionsSolution 1: Using str.replace()Solution 2: Using re.sub()BenchmarkingGoing FurtherReview
Getting StartedIterating Over a Reversed StringCreating a Decision TreeRefactoringSolutionsSolution 1: Using a for Loop and Decision TreeSolution 2: Using a Dictionary LookupSolution 3: Using a List ComprehensionSolution 4: Using str.translate()Solution 5: Using Bio.SeqReview
Getting StartedAn Imperative ApproachSolutionsSolution 1: An Imperative Solution Using a List as a StackSolution 2: Creating a Generator FunctionSolution 3: Using Recursion and MemoizationBenchmarking the SolutionsTesting the Good, the Bad, and the UglyRunning the Test Suite on All the SolutionsGoing FurtherReview
Getting StartedGet Parsing FASTA Using BiopythonIterating the Sequences Using a for LoopSolutionsSolution 1: Using a ListSolution 2: Type Annotations and Unit TestsSolution 3: Keeping a Running Max VariableSolution 4: Using a List Comprehension with a GuardSolution 5: Using the filter() FunctionSolution 6: Using the map() Function and Summing BooleansSolution 7: Using Regular Expressions to Find PatternsSolution 8: A More Complex find_gc() FunctionBenchmarkingGoing FurtherReview
Getting StartedIterating the Characters of Two StringsSolutionsSolution 1: Iterating and CountingSolution 2: Creating a Unit TestSolution 3: Using the zip() FunctionSolution 4: Using the zip_longest() FunctionSolution 5: Using a List ComprehensionSolution 6: Using the filter() FunctionSolution 7: Using the map() Function with zip_longest()Solution 8: Using the starmap() and operator.ne() FunctionsGoing FurtherReview
Getting StartedK-mers and CodonsTranslating CodonsSolutionsSolution 1: Using a for LoopSolution 2: Adding Unit TestsSolution 3: Another Function and a List ComprehensionSolution 4: Functional Programming with the map(), partial(), and takewhile() FunctionsSolution 5: Using Bio.Seq.translate()BenchmarkingGoing FurtherReview
Getting StartedFinding SubsequencesSolutionsSolution 1: Using the str.find() MethodSolution 2: Using the str.index() MethodSolution 3: A Purely Functional ApproachSolution 4: Using K-mersSolution 5: Finding Overlapping Patterns Using Regular ExpressionsBenchmarkingGoing FurtherReview

Getting StartedManaging Runtime Messages with STDOUT, STDERR, and LoggingFinding OverlapsGrouping Sequences by the OverlapSolutionsSolution 1: Using Set Intersections to Find OverlapsSolution 2: Using a Graph to Find All PathsGoing FurtherReview
Getting StartedFinding the Shortest Sequence in a FASTA FileExtracting K-mers from a SequenceSolutionsSolution 1: Counting Frequencies of K-mersSolution 2: Speeding Things Up with a Binary SearchGoing FurtherReview
Getting StartedDownloading Sequences Files on the Command LineDownloading Sequences Files with PythonWriting a Regular Expression to Find the MotifSolutionsSolution 1: Using a Regular ExpressionSolution 2: Writing a Manual SolutionGoing FurtherReview
Getting StartedCreating the Product of ListsAvoiding Overflow with Modular MultiplicationSolutionsSolution 1: Using a Dictionary for the RNA Codon TableSolution 2: Turn the Beat AroundSolution 3: Encoding the Minimal InformationGoing FurtherReview
Getting StartedFinding All Subsequences Using K-mersFinding All Reverse ComplementsPutting It All TogetherSolutionsSolution 1: Using the zip() and enumerate() FunctionsSolution 2: Using the operator.eq() FunctionSolution 3: Writing a revp() FunctionTesting the ProgramGoing FurtherReview
Getting StartedTranslating Proteins Inside Each FrameFinding the ORFs in a Protein SequenceSolutionsSolution 1: Using the str.index() FunctionSolution 2: Using the str.partition() FunctionSolution 3: Using a Regular ExpressionGoing FurtherReview
Using Seqmagick to Analyze Sequence FilesChecking Files Using MD5 HashesGetting StartedFormatting Text Tables Using tabulate()SolutionsSolution 1: Formatting with tabulate()Solution 2: Formatting with richGoing FurtherReview
Finding Lines in a File Using grepThe Structure of a FASTQ RecordGetting StartedGuessing the File FormatSolutionGoing FurtherReview
Understanding Markov ChainsGetting StartedUnderstanding Random SeedsReading the Training FilesGenerating the SequencesStructuring the ProgramSolutionGoing FurtherReview
Getting StartedReviewing the Program ParametersDefining the ParametersNondeterministic SamplingStructuring the ProgramSolutionsSolution 1: Reading Regular FilesSolution 2: Reading a Large Number of Compressed FilesGoing FurtherReview
Introduction to BLASTUsing csvkit and csvchkGetting StartedDefining the ArgumentsParsing Delimited Text Files Using the csv ModuleParsing Delimited Text Files Using the pandas ModuleSolutionsSolution 1: Manually Joining the Tables Using DictionariesSolution 2: Writing the Output File with csv.DictWriter()Solution 3: Reading and Writing Files Using pandasSolution 4: Joining Files Using pandasGoing FurtherReview
Makefiles Are RecipesRunning a Specific TargetRunning with No TargetMakefiles Create DAGsUsing make to Compile a C ProgramUsing make for a ShortcutDefining VariablesWriting a WorkflowOther Workflow ManagersFurther Reading

Content preview from Mastering Python for Bioinformatics

Chapter 6. Finding the Hamming Distance: Counting Point Mutations

The Hamming distance, named after the same Richard Hamming mentioned in the Preface, is the number of edits required to change one string into another. It’s one metric for gauging sequence similarity. I have written a couple of other metrics for this, starting in Chapter 1 with tetranucleotide frequency and continuing in Chapter 5 with GC content. While the latter can be practically informative as coding regions tend to be GC-rich, tetranucleotide frequency falls pretty short of being useful. For example, the sequences AAACCCGGGTTT and CGACGATATGTC are wildly different yet produce the same base frequencies:

$ ./dna.py AAACCCGGGTTT
3 3 3 3
$ ./dna.py CGACGATATGTC
3 3 3 3

Taken alone, tetranucleotide frequency makes these sequences seem identical, but it’s quite obvious that they would produce entirely different protein sequences and so would be functionally unlike. Figure 6-1 depicts an alignment of the 2 sequences indicating that only 3 of the 12 bases are shared, meaning they are only 25% similar.

Another way to express this is to say that 9 of the 12 bases need to be changed to turn one of the sequences into the other. This is the Hamming distance, and it’s somewhat equivalent in bioinformatics to single-nucleotide polymorphisms ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781098100872Errata Page Supplemental Content

Mastering Python for Bioinformatics

by Ken Youens-Clark

Chapter 6. Finding the Hamming Distance: Counting Point Mutations

Figure 6-1. An alignment of two sequences with vertical bars showing matching bases

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Bioinformatics Programming Using Python

Bioinformatics with Python Cookbook - Second Edition

Bioinformatics with Python Cookbook

Machine Learning with Python Cookbook

Publisher Resources

Chapter 6. Finding the Hamming Distance: Counting Point Mutations

Figure 6-1. An alignment of two sequences with vertical bars showing matching bases

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Bioinformatics Programming Using Python

Bioinformatics with Python Cookbook - Second Edition

Bioinformatics with Python Cookbook

Machine Learning with Python Cookbook

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.