book

Mastering Python for Bioinformatics

Name: Mastering Python for Bioinformatics
Author: Ken Youens-Clark
ISBN: 9781098100889

by Ken Youens-Clark

May 2021

Intermediate to advanced

454 pages

10h 42m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Read This?Programming Style: Why I Avoid OOP and ExceptionsStructureTest-Driven DevelopmentUsing the Command Line and Installing PythonGetting the Code and TestsInstalling ModulesInstalling the new.py ProgramWhy Did I Write This Book?Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. The Rosalind.info Challenges
1. Tetranucleotide Frequency: Counting Things
Getting StartedCreating the Program Using new.pyUsing argparseTools for Finding Errors in the CodeIntroducing Named TuplesAdding Types to Named TuplesRepresenting the Arguments with a NamedTupleReading Input from the Command Line or a FileTesting Your ProgramRunning the Program to Test the OutputSolution 1: Iterating and Counting the Characters in a StringCounting the NucleotidesWriting and Verifying a SolutionAdditional SolutionsSolution 2: Creating a count() Function and Adding a Unit TestSolution 3: Using str.count()Solution 4: Using a Dictionary to Count All the CharactersSolution 5: Counting Only the Desired BasesSolution 6: Using collections.defaultdict()Solution 7: Using collections.Counter()Going FurtherReview
2. Transcribing DNA into mRNA: Mutating Strings, Reading and Writing Files
Getting StartedDefining the Program’s ParametersDefining an Optional ParameterDefining One or More Required Positional ParametersUsing nargs to Define the Number of ArgumentsUsing argparse.FileType() to Validate File ArgumentsDefining the Args ClassOutlining the Program Using PseudocodeIterating the Input FilesCreating the Output FilenamesOpening the Output FilesWriting the Output SequencesPrinting the Status ReportUsing the Test SuiteSolutionsSolution 1: Using str.replace()Solution 2: Using re.sub()BenchmarkingGoing FurtherReview
3. Reverse Complement of DNA: String Manipulation
Getting StartedIterating Over a Reversed StringCreating a Decision TreeRefactoringSolutionsSolution 1: Using a for Loop and Decision TreeSolution 2: Using a Dictionary LookupSolution 3: Using a List ComprehensionSolution 4: Using str.translate()Solution 5: Using Bio.SeqReview
4. Creating the Fibonacci Sequence: Writing, Testing, and Benchmarking Algorithms
Getting StartedAn Imperative ApproachSolutionsSolution 1: An Imperative Solution Using a List as a StackSolution 2: Creating a Generator FunctionSolution 3: Using Recursion and MemoizationBenchmarking the SolutionsTesting the Good, the Bad, and the UglyRunning the Test Suite on All the SolutionsGoing FurtherReview
5. Computing GC Content: Parsing FASTA and Analyzing Sequences
Getting StartedGet Parsing FASTA Using BiopythonIterating the Sequences Using a for LoopSolutionsSolution 1: Using a ListSolution 2: Type Annotations and Unit TestsSolution 3: Keeping a Running Max VariableSolution 4: Using a List Comprehension with a GuardSolution 5: Using the filter() FunctionSolution 6: Using the map() Function and Summing BooleansSolution 7: Using Regular Expressions to Find PatternsSolution 8: A More Complex find_gc() FunctionBenchmarkingGoing FurtherReview
6. Finding the Hamming Distance: Counting Point Mutations
Getting StartedIterating the Characters of Two StringsSolutionsSolution 1: Iterating and CountingSolution 2: Creating a Unit TestSolution 3: Using the zip() FunctionSolution 4: Using the zip_longest() FunctionSolution 5: Using a List ComprehensionSolution 6: Using the filter() FunctionSolution 7: Using the map() Function with zip_longest()Solution 8: Using the starmap() and operator.ne() FunctionsGoing FurtherReview
7. Translating mRNA into Protein: More Functional Programming
Getting StartedK-mers and CodonsTranslating CodonsSolutionsSolution 1: Using a for LoopSolution 2: Adding Unit TestsSolution 3: Another Function and a List ComprehensionSolution 4: Functional Programming with the map(), partial(), and takewhile() FunctionsSolution 5: Using Bio.Seq.translate()BenchmarkingGoing FurtherReview
8. Find a Motif in DNA: Exploring Sequence Similarity
Getting StartedFinding SubsequencesSolutionsSolution 1: Using the str.find() MethodSolution 2: Using the str.index() MethodSolution 3: A Purely Functional ApproachSolution 4: Using K-mersSolution 5: Finding Overlapping Patterns Using Regular ExpressionsBenchmarkingGoing FurtherReview

9. Overlap Graphs: Sequence Assembly Using Shared K-mers
Getting StartedManaging Runtime Messages with STDOUT, STDERR, and LoggingFinding OverlapsGrouping Sequences by the OverlapSolutionsSolution 1: Using Set Intersections to Find OverlapsSolution 2: Using a Graph to Find All PathsGoing FurtherReview
10. Finding the Longest Shared Subsequence: Finding K-mers, Writing Functions, and Using Binary Search
Getting StartedFinding the Shortest Sequence in a FASTA FileExtracting K-mers from a SequenceSolutionsSolution 1: Counting Frequencies of K-mersSolution 2: Speeding Things Up with a Binary SearchGoing FurtherReview
11. Finding a Protein Motif: Fetching Data and Using Regular Expressions
Getting StartedDownloading Sequences Files on the Command LineDownloading Sequences Files with PythonWriting a Regular Expression to Find the MotifSolutionsSolution 1: Using a Regular ExpressionSolution 2: Writing a Manual SolutionGoing FurtherReview
12. Inferring mRNA from Protein: Products and Reductions of Lists
Getting StartedCreating the Product of ListsAvoiding Overflow with Modular MultiplicationSolutionsSolution 1: Using a Dictionary for the RNA Codon TableSolution 2: Turn the Beat AroundSolution 3: Encoding the Minimal InformationGoing FurtherReview
13. Location Restriction Sites: Using, Testing, and Sharing Code
Getting StartedFinding All Subsequences Using K-mersFinding All Reverse ComplementsPutting It All TogetherSolutionsSolution 1: Using the zip() and enumerate() FunctionsSolution 2: Using the operator.eq() FunctionSolution 3: Writing a revp() FunctionTesting the ProgramGoing FurtherReview
14. Finding Open Reading Frames
Getting StartedTranslating Proteins Inside Each FrameFinding the ORFs in a Protein SequenceSolutionsSolution 1: Using the str.index() FunctionSolution 2: Using the str.partition() FunctionSolution 3: Using a Regular ExpressionGoing FurtherReview
II. Other Programs
15. Seqmagique: Creating and Formatting Reports
Using Seqmagick to Analyze Sequence FilesChecking Files Using MD5 HashesGetting StartedFormatting Text Tables Using tabulate()SolutionsSolution 1: Formatting with tabulate()Solution 2: Formatting with richGoing FurtherReview
16. FASTX grep: Creating a Utility Program to Select Sequences
Finding Lines in a File Using grepThe Structure of a FASTQ RecordGetting StartedGuessing the File FormatSolutionGoing FurtherReview
17. DNA Synthesizer: Creating Synthetic Data with Markov Chains
Understanding Markov ChainsGetting StartedUnderstanding Random SeedsReading the Training FilesGenerating the SequencesStructuring the ProgramSolutionGoing FurtherReview
18. FASTX Sampler: Randomly Subsampling Sequence Files
Getting StartedReviewing the Program ParametersDefining the ParametersNondeterministic SamplingStructuring the ProgramSolutionsSolution 1: Reading Regular FilesSolution 2: Reading a Large Number of Compressed FilesGoing FurtherReview
19. Blastomatic: Parsing Delimited Text Files
Introduction to BLASTUsing csvkit and csvchkGetting StartedDefining the ArgumentsParsing Delimited Text Files Using the csv ModuleParsing Delimited Text Files Using the pandas ModuleSolutionsSolution 1: Manually Joining the Tables Using DictionariesSolution 2: Writing the Output File with csv.DictWriter()Solution 3: Reading and Writing Files Using pandasSolution 4: Joining Files Using pandasGoing FurtherReview
A. Documenting Commands and Creating Workflows with make
Makefiles Are RecipesRunning a Specific TargetRunning with No TargetMakefiles Create DAGsUsing make to Compile a C ProgramUsing make for a ShortcutDefining VariablesWriting a WorkflowOther Workflow ManagersFurther Reading
B. Understanding $PATH and Installing Command-Line Programs
Epilogue
Index
About the Author

Content preview from Mastering Python for Bioinformatics

Chapter 1. Tetranucleotide Frequency: Counting Things

Counting the bases in DNA is perhaps the “Hello, World!” of bioinformatics. The Rosalind DNA challenge describes a program that will take a sequence of DNA and print a count of how many As, Cs, Gs, and Ts are found. There are surprisingly many ways to count things in Python, and I’ll explore what the language has to offer. I’ll also demonstrate how to write a well-structured, documented program that validates its arguments as well as how to write and run tests to ensure the program works correctly.

In this chapter, you’ll learn:

How to start a new program using new.py
How to define and validate command-line arguments using argparse
How to run a test suite using pytest
How to iterate the characters of a string
Ways to count elements in a collection
How to create a decision tree using if/elif statements
How to format strings

Getting Started

Before you start, be sure you have read “Getting the Code and Tests” in the Preface. Once you have a local copy of the code repository, change into the 01_dna directory:

$ cd 01_dna

Here you’ll find several solution*.py programs along with tests and input data you can use to see if the programs work correctly. To get an idea of how your program should work, start by copying the first solution to a program called dna.py:

$ cp solution1_iter.py dna.py

Now run the program with no arguments, or with the -h or --help flags. It will print usage documentation (note that usage is ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Bioinformatics with Python Cookbook - Second Edition

Publisher Resources

ISBN: 9781098100872Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Mastering Python for Bioinformatics

by Ken Youens-Clark

Chapter 1. Tetranucleotide Frequency: Counting Things

Getting Started

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.