book

Mastering Python for Bioinformatics

by Ken Youens-Clark

May 2021

Intermediate to advanced

454 pages

10h 42m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Who Should Read This?Programming Style: Why I Avoid OOP and ExceptionsStructureTest-Driven DevelopmentUsing the Command Line and Installing PythonGetting the Code and TestsInstalling ModulesInstalling the new.py ProgramWhy Did I Write This Book?Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Getting StartedCreating the Program Using new.pyUsing argparseTools for Finding Errors in the CodeIntroducing Named TuplesAdding Types to Named TuplesRepresenting the Arguments with a NamedTupleReading Input from the Command Line or a FileTesting Your ProgramRunning the Program to Test the OutputSolution 1: Iterating and Counting the Characters in a StringCounting the NucleotidesWriting and Verifying a SolutionAdditional SolutionsSolution 2: Creating a count() Function and Adding a Unit TestSolution 3: Using str.count()Solution 4: Using a Dictionary to Count All the CharactersSolution 5: Counting Only the Desired BasesSolution 6: Using collections.defaultdict()Solution 7: Using collections.Counter()Going FurtherReview
Getting StartedDefining the Program’s ParametersDefining an Optional ParameterDefining One or More Required Positional ParametersUsing nargs to Define the Number of ArgumentsUsing argparse.FileType() to Validate File ArgumentsDefining the Args ClassOutlining the Program Using PseudocodeIterating the Input FilesCreating the Output FilenamesOpening the Output FilesWriting the Output SequencesPrinting the Status ReportUsing the Test SuiteSolutionsSolution 1: Using str.replace()Solution 2: Using re.sub()BenchmarkingGoing FurtherReview
Getting StartedIterating Over a Reversed StringCreating a Decision TreeRefactoringSolutionsSolution 1: Using a for Loop and Decision TreeSolution 2: Using a Dictionary LookupSolution 3: Using a List ComprehensionSolution 4: Using str.translate()Solution 5: Using Bio.SeqReview
Getting StartedAn Imperative ApproachSolutionsSolution 1: An Imperative Solution Using a List as a StackSolution 2: Creating a Generator FunctionSolution 3: Using Recursion and MemoizationBenchmarking the SolutionsTesting the Good, the Bad, and the UglyRunning the Test Suite on All the SolutionsGoing FurtherReview
Getting StartedGet Parsing FASTA Using BiopythonIterating the Sequences Using a for LoopSolutionsSolution 1: Using a ListSolution 2: Type Annotations and Unit TestsSolution 3: Keeping a Running Max VariableSolution 4: Using a List Comprehension with a GuardSolution 5: Using the filter() FunctionSolution 6: Using the map() Function and Summing BooleansSolution 7: Using Regular Expressions to Find PatternsSolution 8: A More Complex find_gc() FunctionBenchmarkingGoing FurtherReview
Getting StartedIterating the Characters of Two StringsSolutionsSolution 1: Iterating and CountingSolution 2: Creating a Unit TestSolution 3: Using the zip() FunctionSolution 4: Using the zip_longest() FunctionSolution 5: Using a List ComprehensionSolution 6: Using the filter() FunctionSolution 7: Using the map() Function with zip_longest()Solution 8: Using the starmap() and operator.ne() FunctionsGoing FurtherReview
Getting StartedK-mers and CodonsTranslating CodonsSolutionsSolution 1: Using a for LoopSolution 2: Adding Unit TestsSolution 3: Another Function and a List ComprehensionSolution 4: Functional Programming with the map(), partial(), and takewhile() FunctionsSolution 5: Using Bio.Seq.translate()BenchmarkingGoing FurtherReview
Getting StartedFinding SubsequencesSolutionsSolution 1: Using the str.find() MethodSolution 2: Using the str.index() MethodSolution 3: A Purely Functional ApproachSolution 4: Using K-mersSolution 5: Finding Overlapping Patterns Using Regular ExpressionsBenchmarkingGoing FurtherReview

Getting StartedManaging Runtime Messages with STDOUT, STDERR, and LoggingFinding OverlapsGrouping Sequences by the OverlapSolutionsSolution 1: Using Set Intersections to Find OverlapsSolution 2: Using a Graph to Find All PathsGoing FurtherReview
Getting StartedFinding the Shortest Sequence in a FASTA FileExtracting K-mers from a SequenceSolutionsSolution 1: Counting Frequencies of K-mersSolution 2: Speeding Things Up with a Binary SearchGoing FurtherReview
Getting StartedDownloading Sequences Files on the Command LineDownloading Sequences Files with PythonWriting a Regular Expression to Find the MotifSolutionsSolution 1: Using a Regular ExpressionSolution 2: Writing a Manual SolutionGoing FurtherReview
Getting StartedCreating the Product of ListsAvoiding Overflow with Modular MultiplicationSolutionsSolution 1: Using a Dictionary for the RNA Codon TableSolution 2: Turn the Beat AroundSolution 3: Encoding the Minimal InformationGoing FurtherReview
Getting StartedFinding All Subsequences Using K-mersFinding All Reverse ComplementsPutting It All TogetherSolutionsSolution 1: Using the zip() and enumerate() FunctionsSolution 2: Using the operator.eq() FunctionSolution 3: Writing a revp() FunctionTesting the ProgramGoing FurtherReview
Getting StartedTranslating Proteins Inside Each FrameFinding the ORFs in a Protein SequenceSolutionsSolution 1: Using the str.index() FunctionSolution 2: Using the str.partition() FunctionSolution 3: Using a Regular ExpressionGoing FurtherReview
Using Seqmagick to Analyze Sequence FilesChecking Files Using MD5 HashesGetting StartedFormatting Text Tables Using tabulate()SolutionsSolution 1: Formatting with tabulate()Solution 2: Formatting with richGoing FurtherReview
Finding Lines in a File Using grepThe Structure of a FASTQ RecordGetting StartedGuessing the File FormatSolutionGoing FurtherReview
Understanding Markov ChainsGetting StartedUnderstanding Random SeedsReading the Training FilesGenerating the SequencesStructuring the ProgramSolutionGoing FurtherReview
Getting StartedReviewing the Program ParametersDefining the ParametersNondeterministic SamplingStructuring the ProgramSolutionsSolution 1: Reading Regular FilesSolution 2: Reading a Large Number of Compressed FilesGoing FurtherReview
Introduction to BLASTUsing csvkit and csvchkGetting StartedDefining the ArgumentsParsing Delimited Text Files Using the csv ModuleParsing Delimited Text Files Using the pandas ModuleSolutionsSolution 1: Manually Joining the Tables Using DictionariesSolution 2: Writing the Output File with csv.DictWriter()Solution 3: Reading and Writing Files Using pandasSolution 4: Joining Files Using pandasGoing FurtherReview
Makefiles Are RecipesRunning a Specific TargetRunning with No TargetMakefiles Create DAGsUsing make to Compile a C ProgramUsing make for a ShortcutDefining VariablesWriting a WorkflowOther Workflow ManagersFurther Reading

Content preview from Mastering Python for Bioinformatics

Chapter 15. Seqmagique: Creating and Formatting Reports

Often in bioinformatics projects, you’ll find yourself staring at a directory full of sequence files, probably in FASTA or FASTQ format. You’ll probably want to start by getting an idea of the distribution of sequences in the files, such as how many are in each file and the average, minimum, and maximum lengths of the sequences. You need to know if any files are corrupted—maybe they didn’t transfer completely from your sequencing center—or if any samples have far fewer reads, perhaps indicating a bad sequencing run that needs to be redone. In this chapter, I’ll introduce some techniques for checking your sequence files using hashes and the Seqmagick tool. Then I’ll write a small utility to mimic part of Seqmagick to illustrate how to create formatted text tables. This program serves as a template for any program that needs to process all the records in a given set of files and produce a table of summary statistics.

You will learn:

How to install the seqmagick tool
How to use MD5 hashes
How to use choices in argparse to constrain arguments
How to use the numpy module
How to mock a filehandle
How to use the tabulate and rich modules to format output tables

Using Seqmagick to Analyze Sequence Files

seqmagick is a useful command-line utility for handling sequence files. This should have been installed along with the other Python modules if you followed the setup instructions in the Preface. If not, you can install ...