book

Bioinformatics Data Skills

Name: Bioinformatics Data Skills
Author: Vince Buffalo
ISBN: 9781449367503

by Vince Buffalo

July 2015

Intermediate to advanced

538 pages

15h 29m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
The Approach of This BookWhy This Book Focuses on Sequencing DataAudienceThe Difficulty Level of Bioinformatics Data SkillsAssumptions This Book MakesSupplementary Material on GitHubComputing Resources and SetupOrganization of This BookCode ConventionsConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
I. Ideology: Data Skills for Robust and Reproducible Bioinformatics
1. How to Learn Bioinformatics
Why Bioinformatics? Biology’s Growing DataLearning Data Skills to Learn BioinformaticsNew Challenges for Reproducible and Robust ResearchReproducible ResearchRobust Research and the Golden Rule of BioinformaticsAdopting Robust and Reproducible Practices Will Make Your Life Easier, TooRecommendations for Robust ResearchPay Attention to Experimental DesignWrite Code for Humans, Write Data for ComputersLet Your Computer Do the Work For YouMake Assertions and Be Loud, in Code and in Your MethodsTest Code, or Better Yet, Let Code Test CodeUse Existing Libraries Whenever PossibleTreat Data as Read-OnlySpend Time Developing Frequently Used Scripts into ToolsLet Data Prove That It’s High QualityRecommendations for Reproducible ResearchRelease Your Code and DataDocument EverythingMake Figures and Statistics the Results of ScriptsUse Code as DocumentationContinually Improving Your Bioinformatics Data Skills
II. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project
2. Setting Up and Managing a Bioinformatics Project
Project Directories and Directory StructuresProject DocumentationUse Directories to Divide Up Your Project into SubprojectsOrganizing Data to Automate File Processing TasksMarkdown for Project NotebooksMarkdown Formatting BasicsUsing Pandoc to Render Markdown to HTML
3. Remedial Unix Shell
Why Do We Use Unix in Bioinformatics? Modularity and the Unix PhilosophyWorking with Streams and RedirectionRedirecting Standard Out to a FileRedirecting Standard ErrorUsing Standard Input RedirectionThe Almighty Unix Pipe: Speed and Beauty in OnePipes in Action: Creating Simple Programs with Grep and PipesCombining Pipes and RedirectionEven More Redirection: A tee in Your PipeManaging and Interacting with ProcessesBackground ProcessesKilling ProcessesExit Status: How to Programmatically Tell Whether Your Command WorkedCommand Substitution
4. Working with Remote Machines
Connecting to Remote Machines with SSHQuick Authentication with SSH KeysMaintaining Long-Running Jobs with nohup and tmuxnohupWorking with Remote Machines Through TmuxInstalling and Configuring TmuxCreating, Detaching, and Attaching Tmux SessionsWorking with Tmux Windows
5. Git for Scientists
Why Git Is Necessary in Bioinformatics ProjectsGit Allows You to Keep Snapshots of Your ProjectGit Helps You Keep Track of Important Changes to CodeGit Helps Keep Software Organized and Available After People LeaveInstalling GitBasic Git: Creating Repositories, Tracking Files, and Staging and Committing ChangesGit Setup: Telling Git Who You Aregit init and git clone: Creating RepositoriesTracking Files in Git: git add and git status Part IStaging Files in Git: git add and git status Part IIgit commit: Taking a Snapshot of Your ProjectSeeing File Differences: git diffSeeing Your Commit History: git logMoving and Removing Files: git mv and git rmTelling Git What to Ignore: .gitignoreUndoing a Stage: git resetCollaborating with Git: Git Remotes, git push, and git pullCreating a Shared Central Repository with GitHubAuthenticating with Git RemotesConnecting with Git Remotes: git remotePushing Commits to a Remote Repository with git pushPulling Commits from a Remote Repository with git pullWorking with Your Collaborators: Pushing and PullingMerge ConflictsMore GitHub Workflows: Forking and Pull RequestsUsing Git to Make Life Easier: Working with Past CommitsGetting Files from the Past: git checkoutStashing Your Changes: git stashMore git diff: Comparing Commits and FilesUndoing and Editing Commits: git commit --amendWorking with BranchesCreating and Working with Branches: git branch and git checkoutMerging Branches: git mergeBranches and RemotesContinuing Your Git Education
6. Bioinformatics Data
Retrieving Bioinformatics DataDownloading Data with wget and curlRsync and Secure Copy (scp)Data IntegritySHA and MD5 ChecksumsLooking at Differences Between DataCompressing Data and Working with Compressed DatagzipWorking with Gzipped Compressed FilesCase Study: Reproducibly Downloading Data
III. Practice: Bioinformatics Data Skills

7. Unix Data Tools
Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming PearlsWhen to Use the Unix Pipeline Approach and How to Use It SafelyInspecting and Manipulating Text Data with Unix ToolsInspecting Data with Head and TaillessPlain-Text Data Summary Information with wc, ls, and awkWorking with Column Data with cut and ColumnsFormatting Tabular Data with columnThe All-Powerful GrepDecoding Plain-Text Data: hexdumpSorting Plain-Text Data with SortFinding Unique Values in UniqJoinText Processing with AwkBioawk: An Awk for Biological FormatsStream Editing with SedAdvanced Shell TricksSubshellsNamed Pipes and Process SubstitutionThe Unix Philosophy Revisited
8. A Rapid Introduction to the R Language
Getting Started with R and RStudioR Language BasicsSimple Calculations in R, Calling Functions, and Getting Help in RVariables and AssignmentVectors, Vectorization, and IndexingWorking with and Visualizing Data in RLoading Data into RExploring and Transforming DataframesExploring Data Through Slicing and Dicing: Subsetting DataframesExploring Data Visually with ggplot2 I: Scatterplots and DensitiesExploring Data Visually with ggplot2 II: SmoothingBinning Data with cut() and Bar Plots with ggplot2Merging and Combining Data: Matching Vectors and Merging DataframesUsing ggplot2 FacetsMore R Data Structures: ListsWriting and Applying Functions to Lists with lapply() and sapply()Working with the Split-Apply-Combine PatternExploring Dataframes with dplyrWorking with StringsDeveloping Workflows with R ScriptsControl Flow: if, for, and whileWorking with R ScriptsWorkflows for Loading and Combining Multiple FilesExporting DataFurther R Directions and Resources
9. Working with Range Data
A Crash Course in Genomic Ranges and Coordinate SystemsAn Interactive Introduction to Range Data with GenomicRangesInstalling and Working with Bioconductor PackagesStoring Generic Ranges with IRangesBasic Range Operations: Arithmetic, Transformations, and Set OperationsFinding Overlapping RangesFinding Nearest Ranges and Calculating DistanceRun Length Encoding and ViewsStoring Genomic Ranges with GenomicRangesGrouping Data with GRangesListWorking with Annotation Data: GenomicFeatures and rtracklayerRetrieving Promoter Regions: Flank and PromotersRetrieving Promoter Sequence: Connection GenomicRanges with Sequence DataGetting Intergenic and Intronic Regions: Gaps, Reduce, and Setdiffs in PracticeFinding and Working with Overlapping RangesCalculating Coverage of GRanges ObjectsWorking with Ranges Data on the Command Line with BEDToolsComputing Overlaps with BEDTools IntersectBEDTools Slop and FlankCoverage with BEDToolsOther BEDTools Subcommands and pybedtools
10. Working with Sequence Data
The FASTA FormatThe FASTQ FormatNucleotide CodesBase QualitiesExample: Inspecting and Trimming Low-Quality BasesA FASTA/FASTQ Parsing Example: Counting NucleotidesIndexed FASTA Files
11. Working with Alignment Data
Getting to Know Alignment Formats: SAM and BAMThe SAM HeaderThe SAM Alignment SectionBitwise FlagsCIGAR StringsMapping QualitiesCommand-Line Tools for Working with Alignments in the SAM FormatUsing samtools view to Convert between SAM and BAMSamtools Sort and IndexExtracting and Filtering Alignments with samtools viewVisualizing Alignments with samtools tview and the Integrated Genomics ViewerPileups with samtools pileup, Variant Calling, and Base Alignment QualityCreating Your Own SAM/BAM Processing Tools with PysamOpening BAM Files, Fetching Alignments from a Region, and Iterating Across ReadsExtracting SAM/BAM Header Information from an AlignmentFile ObjectWorking with AlignedSegment ObjectsWriting a Program to Record Alignment StatisticsAdditional Pysam Features and Other SAM/BAM APIs
12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks
Basic Bash ScriptingWriting and Running Robust Bash ScriptsVariables and Command ArgumentsConditionals in a Bash Script: if StatementsProcessing Files with Bash Using for Loops and GlobbingAutomating File-Processing with find and xargsUsing find and xargsFinding Files with findfind’s Expressionsfind’s -exec: Running Commands on find’s Resultsxargs: A Unix PowertoolUsing xargs with Replacement Strings to Apply Commands to Filesxargs and ParallelizationMake and Makefiles: Another Option for Pipelines
13. Out-of-Memory Approaches: Tabix and SQLite
Fast Access to Indexed Tab-Delimited Files with BGZF and TabixCompressing Files for Tabix with BgzipIndexing Files with TabixUsing TabixIntroducing Relational Databases Through SQLiteWhen to Use Relational Databases in BioinformaticsInstalling SQLiteExploring SQLite Databases with the Command-Line InterfaceQuerying Out Data: The Almighty SELECT CommandSQLite FunctionsSQLite Aggregate FunctionsSubqueriesOrganizing Relational Databases and JoinsWriting to DatabasesDropping Tables and Deleting DatabasesInteracting with SQLite from PythonDumping Databases
14. Conclusion
Where to Go From Here?
Glossary
Bibliography
Index

Content preview from Bioinformatics Data Skills

Chapter 11. Working with Alignment Data

In Chapter 9, we learned about range formats such as BED and GTF, which are often used to store genomic range data associated with genomic feature annotation data such as gene models. Other kinds of range-based formats are designed for storing large amounts of alignment data—for example, the results of aligning millions (or billions) of high-throughput sequencing reads to a genome. In this chapter, we’ll look at the most common high-throughput data alignment format: the Sequence Alignment/Mapping (SAM) format for mapping data (and its binary analog, BAM). The SAM and BAM formats are the standard formats for storing sequencing reads mapped to a reference.

We study SAM and BAM for two reasons. First, a huge part of bioinformatics work is manipulating alignment files. Nearly every high-throughput sequencing experiment involves an alignment step that produces alignment data in the SAM/BAM formats. Because each sequencing read has an alignment entry, alignment data files are massive and require space-efficient complex binary file formats. Furthermore, modern aligners output an incredible amount of useful information about each alignment. It’s vital to have the skills necessary to extract this information and explore data kept in these complex formats.

Second, the skills developed through learning to work with SAM/BAM files are extensible and more widely applicable than to these specific formats. It would be unwise to bet that these formats won’t ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical Statistics for Data Scientists

Publisher Resources

ISBN: 9781449367480Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Bioinformatics Data Skills

by Vince Buffalo

Chapter 11. Working with Alignment Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.