book

Mastering Python for Bioinformatics

by Ken Youens-Clark

May 2021

Intermediate to advanced

454 pages

10h 42m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Who Should Read This?Programming Style: Why I Avoid OOP and ExceptionsStructureTest-Driven DevelopmentUsing the Command Line and Installing PythonGetting the Code and TestsInstalling ModulesInstalling the new.py ProgramWhy Did I Write This Book?Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Getting StartedCreating the Program Using new.pyUsing argparseTools for Finding Errors in the CodeIntroducing Named TuplesAdding Types to Named TuplesRepresenting the Arguments with a NamedTupleReading Input from the Command Line or a FileTesting Your ProgramRunning the Program to Test the OutputSolution 1: Iterating and Counting the Characters in a StringCounting the NucleotidesWriting and Verifying a SolutionAdditional SolutionsSolution 2: Creating a count() Function and Adding a Unit TestSolution 3: Using str.count()Solution 4: Using a Dictionary to Count All the CharactersSolution 5: Counting Only the Desired BasesSolution 6: Using collections.defaultdict()Solution 7: Using collections.Counter()Going FurtherReview
Getting StartedDefining the Program’s ParametersDefining an Optional ParameterDefining One or More Required Positional ParametersUsing nargs to Define the Number of ArgumentsUsing argparse.FileType() to Validate File ArgumentsDefining the Args ClassOutlining the Program Using PseudocodeIterating the Input FilesCreating the Output FilenamesOpening the Output FilesWriting the Output SequencesPrinting the Status ReportUsing the Test SuiteSolutionsSolution 1: Using str.replace()Solution 2: Using re.sub()BenchmarkingGoing FurtherReview
Getting StartedIterating Over a Reversed StringCreating a Decision TreeRefactoringSolutionsSolution 1: Using a for Loop and Decision TreeSolution 2: Using a Dictionary LookupSolution 3: Using a List ComprehensionSolution 4: Using str.translate()Solution 5: Using Bio.SeqReview
Getting StartedAn Imperative ApproachSolutionsSolution 1: An Imperative Solution Using a List as a StackSolution 2: Creating a Generator FunctionSolution 3: Using Recursion and MemoizationBenchmarking the SolutionsTesting the Good, the Bad, and the UglyRunning the Test Suite on All the SolutionsGoing FurtherReview
Getting StartedGet Parsing FASTA Using BiopythonIterating the Sequences Using a for LoopSolutionsSolution 1: Using a ListSolution 2: Type Annotations and Unit TestsSolution 3: Keeping a Running Max VariableSolution 4: Using a List Comprehension with a GuardSolution 5: Using the filter() FunctionSolution 6: Using the map() Function and Summing BooleansSolution 7: Using Regular Expressions to Find PatternsSolution 8: A More Complex find_gc() FunctionBenchmarkingGoing FurtherReview
Getting StartedIterating the Characters of Two StringsSolutionsSolution 1: Iterating and CountingSolution 2: Creating a Unit TestSolution 3: Using the zip() FunctionSolution 4: Using the zip_longest() FunctionSolution 5: Using a List ComprehensionSolution 6: Using the filter() FunctionSolution 7: Using the map() Function with zip_longest()Solution 8: Using the starmap() and operator.ne() FunctionsGoing FurtherReview
Getting StartedK-mers and CodonsTranslating CodonsSolutionsSolution 1: Using a for LoopSolution 2: Adding Unit TestsSolution 3: Another Function and a List ComprehensionSolution 4: Functional Programming with the map(), partial(), and takewhile() FunctionsSolution 5: Using Bio.Seq.translate()BenchmarkingGoing FurtherReview
Getting StartedFinding SubsequencesSolutionsSolution 1: Using the str.find() MethodSolution 2: Using the str.index() MethodSolution 3: A Purely Functional ApproachSolution 4: Using K-mersSolution 5: Finding Overlapping Patterns Using Regular ExpressionsBenchmarkingGoing FurtherReview

Getting StartedManaging Runtime Messages with STDOUT, STDERR, and LoggingFinding OverlapsGrouping Sequences by the OverlapSolutionsSolution 1: Using Set Intersections to Find OverlapsSolution 2: Using a Graph to Find All PathsGoing FurtherReview
Getting StartedFinding the Shortest Sequence in a FASTA FileExtracting K-mers from a SequenceSolutionsSolution 1: Counting Frequencies of K-mersSolution 2: Speeding Things Up with a Binary SearchGoing FurtherReview
Getting StartedDownloading Sequences Files on the Command LineDownloading Sequences Files with PythonWriting a Regular Expression to Find the MotifSolutionsSolution 1: Using a Regular ExpressionSolution 2: Writing a Manual SolutionGoing FurtherReview
Getting StartedCreating the Product of ListsAvoiding Overflow with Modular MultiplicationSolutionsSolution 1: Using a Dictionary for the RNA Codon TableSolution 2: Turn the Beat AroundSolution 3: Encoding the Minimal InformationGoing FurtherReview
Getting StartedFinding All Subsequences Using K-mersFinding All Reverse ComplementsPutting It All TogetherSolutionsSolution 1: Using the zip() and enumerate() FunctionsSolution 2: Using the operator.eq() FunctionSolution 3: Writing a revp() FunctionTesting the ProgramGoing FurtherReview
Getting StartedTranslating Proteins Inside Each FrameFinding the ORFs in a Protein SequenceSolutionsSolution 1: Using the str.index() FunctionSolution 2: Using the str.partition() FunctionSolution 3: Using a Regular ExpressionGoing FurtherReview
Using Seqmagick to Analyze Sequence FilesChecking Files Using MD5 HashesGetting StartedFormatting Text Tables Using tabulate()SolutionsSolution 1: Formatting with tabulate()Solution 2: Formatting with richGoing FurtherReview
Finding Lines in a File Using grepThe Structure of a FASTQ RecordGetting StartedGuessing the File FormatSolutionGoing FurtherReview
Understanding Markov ChainsGetting StartedUnderstanding Random SeedsReading the Training FilesGenerating the SequencesStructuring the ProgramSolutionGoing FurtherReview
Getting StartedReviewing the Program ParametersDefining the ParametersNondeterministic SamplingStructuring the ProgramSolutionsSolution 1: Reading Regular FilesSolution 2: Reading a Large Number of Compressed FilesGoing FurtherReview
Introduction to BLASTUsing csvkit and csvchkGetting StartedDefining the ArgumentsParsing Delimited Text Files Using the csv ModuleParsing Delimited Text Files Using the pandas ModuleSolutionsSolution 1: Manually Joining the Tables Using DictionariesSolution 2: Writing the Output File with csv.DictWriter()Solution 3: Reading and Writing Files Using pandasSolution 4: Joining Files Using pandasGoing FurtherReview
Makefiles Are RecipesRunning a Specific TargetRunning with No TargetMakefiles Create DAGsUsing make to Compile a C ProgramUsing make for a ShortcutDefining VariablesWriting a WorkflowOther Workflow ManagersFurther Reading

Content preview from Mastering Python for Bioinformatics

Chapter 14. Finding Open Reading Frames

The ORF challenge is the last Rosalind problem I’ll tackle in this book. The goal is to find all the possible open reading frames (ORFs) in a sequence of DNA. An ORF is a region of nucleotides between the start codon and the stop codon. The solution will consider both the forward and reverse complement as well as frameshifts. Although there are existing tools such as TransDecoder to find coding regions, writing a bespoke solution brings together many skills from previous chapters, including reading a FASTA file, creating the reverse complement of a sequence, using string slices, finding k-mers, using multiple for loops/iterations, translating DNA, and using regular expressions.

You will learn:

How to truncate a sequence to a length evenly divisible by a codon size
How to use the str.find() and str.partition() functions
How to document a regular expression using code formatting, comments, and Python’s implicit string concatenation

Getting Started

The code, tests, and solutions for this challenge are located in the 14_orf directory. Start by copying the first solution to the program orf.py:

$ cd 14_orf/
$ cp solution1_iterate_set.py orf.py

If you request the usage, you’ll see the program takes a single positional argument of a FASTA-formatted file of sequences:

$ ./orf.py -h
usage: orf.py [-h] FILE

Open Reading Frames

positional arguments:
  FILE        Input FASTA file

optional arguments:
  -h, --help  show this help message and exit

The first ...