book

Bioinformatics Data Skills

by Vince Buffalo

July 2015

Intermediate to advanced

538 pages

15h 29m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

The Approach of This BookWhy This Book Focuses on Sequencing DataAudienceThe Difficulty Level of Bioinformatics Data SkillsAssumptions This Book MakesSupplementary Material on GitHubComputing Resources and SetupOrganization of This BookCode ConventionsConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
Why Bioinformatics? Biology’s Growing DataLearning Data Skills to Learn BioinformaticsNew Challenges for Reproducible and Robust ResearchReproducible ResearchRobust Research and the Golden Rule of BioinformaticsAdopting Robust and Reproducible Practices Will Make Your Life Easier, TooRecommendations for Robust ResearchPay Attention to Experimental DesignWrite Code for Humans, Write Data for ComputersLet Your Computer Do the Work For YouMake Assertions and Be Loud, in Code and in Your MethodsTest Code, or Better Yet, Let Code Test CodeUse Existing Libraries Whenever PossibleTreat Data as Read-OnlySpend Time Developing Frequently Used Scripts into ToolsLet Data Prove That It’s High QualityRecommendations for Reproducible ResearchRelease Your Code and DataDocument EverythingMake Figures and Statistics the Results of ScriptsUse Code as DocumentationContinually Improving Your Bioinformatics Data Skills
Project Directories and Directory StructuresProject DocumentationUse Directories to Divide Up Your Project into SubprojectsOrganizing Data to Automate File Processing TasksMarkdown for Project NotebooksMarkdown Formatting BasicsUsing Pandoc to Render Markdown to HTML
Why Do We Use Unix in Bioinformatics? Modularity and the Unix PhilosophyWorking with Streams and RedirectionRedirecting Standard Out to a FileRedirecting Standard ErrorUsing Standard Input RedirectionThe Almighty Unix Pipe: Speed and Beauty in OnePipes in Action: Creating Simple Programs with Grep and PipesCombining Pipes and RedirectionEven More Redirection: A tee in Your PipeManaging and Interacting with ProcessesBackground ProcessesKilling ProcessesExit Status: How to Programmatically Tell Whether Your Command WorkedCommand Substitution
Connecting to Remote Machines with SSHQuick Authentication with SSH KeysMaintaining Long-Running Jobs with nohup and tmuxnohupWorking with Remote Machines Through TmuxInstalling and Configuring TmuxCreating, Detaching, and Attaching Tmux SessionsWorking with Tmux Windows
Why Git Is Necessary in Bioinformatics ProjectsGit Allows You to Keep Snapshots of Your ProjectGit Helps You Keep Track of Important Changes to CodeGit Helps Keep Software Organized and Available After People LeaveInstalling GitBasic Git: Creating Repositories, Tracking Files, and Staging and Committing ChangesGit Setup: Telling Git Who You Aregit init and git clone: Creating RepositoriesTracking Files in Git: git add and git status Part IStaging Files in Git: git add and git status Part IIgit commit: Taking a Snapshot of Your ProjectSeeing File Differences: git diffSeeing Your Commit History: git logMoving and Removing Files: git mv and git rmTelling Git What to Ignore: .gitignoreUndoing a Stage: git resetCollaborating with Git: Git Remotes, git push, and git pullCreating a Shared Central Repository with GitHubAuthenticating with Git RemotesConnecting with Git Remotes: git remotePushing Commits to a Remote Repository with git pushPulling Commits from a Remote Repository with git pullWorking with Your Collaborators: Pushing and PullingMerge ConflictsMore GitHub Workflows: Forking and Pull RequestsUsing Git to Make Life Easier: Working with Past CommitsGetting Files from the Past: git checkoutStashing Your Changes: git stashMore git diff: Comparing Commits and FilesUndoing and Editing Commits: git commit --amendWorking with BranchesCreating and Working with Branches: git branch and git checkoutMerging Branches: git mergeBranches and RemotesContinuing Your Git Education
Retrieving Bioinformatics DataDownloading Data with wget and curlRsync and Secure Copy (scp)Data IntegritySHA and MD5 ChecksumsLooking at Differences Between DataCompressing Data and Working with Compressed DatagzipWorking with Gzipped Compressed FilesCase Study: Reproducibly Downloading Data

Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming PearlsWhen to Use the Unix Pipeline Approach and How to Use It SafelyInspecting and Manipulating Text Data with Unix ToolsInspecting Data with Head and TaillessPlain-Text Data Summary Information with wc, ls, and awkWorking with Column Data with cut and ColumnsFormatting Tabular Data with columnThe All-Powerful GrepDecoding Plain-Text Data: hexdumpSorting Plain-Text Data with SortFinding Unique Values in UniqJoinText Processing with AwkBioawk: An Awk for Biological FormatsStream Editing with SedAdvanced Shell TricksSubshellsNamed Pipes and Process SubstitutionThe Unix Philosophy Revisited
Getting Started with R and RStudioR Language BasicsSimple Calculations in R, Calling Functions, and Getting Help in RVariables and AssignmentVectors, Vectorization, and IndexingWorking with and Visualizing Data in RLoading Data into RExploring and Transforming DataframesExploring Data Through Slicing and Dicing: Subsetting DataframesExploring Data Visually with ggplot2 I: Scatterplots and DensitiesExploring Data Visually with ggplot2 II: SmoothingBinning Data with cut() and Bar Plots with ggplot2Merging and Combining Data: Matching Vectors and Merging DataframesUsing ggplot2 FacetsMore R Data Structures: ListsWriting and Applying Functions to Lists with lapply() and sapply()Working with the Split-Apply-Combine PatternExploring Dataframes with dplyrWorking with StringsDeveloping Workflows with R ScriptsControl Flow: if, for, and whileWorking with R ScriptsWorkflows for Loading and Combining Multiple FilesExporting DataFurther R Directions and Resources
A Crash Course in Genomic Ranges and Coordinate SystemsAn Interactive Introduction to Range Data with GenomicRangesInstalling and Working with Bioconductor PackagesStoring Generic Ranges with IRangesBasic Range Operations: Arithmetic, Transformations, and Set OperationsFinding Overlapping RangesFinding Nearest Ranges and Calculating DistanceRun Length Encoding and ViewsStoring Genomic Ranges with GenomicRangesGrouping Data with GRangesListWorking with Annotation Data: GenomicFeatures and rtracklayerRetrieving Promoter Regions: Flank and PromotersRetrieving Promoter Sequence: Connection GenomicRanges with Sequence DataGetting Intergenic and Intronic Regions: Gaps, Reduce, and Setdiffs in PracticeFinding and Working with Overlapping RangesCalculating Coverage of GRanges ObjectsWorking with Ranges Data on the Command Line with BEDToolsComputing Overlaps with BEDTools IntersectBEDTools Slop and FlankCoverage with BEDToolsOther BEDTools Subcommands and pybedtools
The FASTA FormatThe FASTQ FormatNucleotide CodesBase QualitiesExample: Inspecting and Trimming Low-Quality BasesA FASTA/FASTQ Parsing Example: Counting NucleotidesIndexed FASTA Files
Getting to Know Alignment Formats: SAM and BAMThe SAM HeaderThe SAM Alignment SectionBitwise FlagsCIGAR StringsMapping QualitiesCommand-Line Tools for Working with Alignments in the SAM FormatUsing samtools view to Convert between SAM and BAMSamtools Sort and IndexExtracting and Filtering Alignments with samtools viewVisualizing Alignments with samtools tview and the Integrated Genomics ViewerPileups with samtools pileup, Variant Calling, and Base Alignment QualityCreating Your Own SAM/BAM Processing Tools with PysamOpening BAM Files, Fetching Alignments from a Region, and Iterating Across ReadsExtracting SAM/BAM Header Information from an AlignmentFile ObjectWorking with AlignedSegment ObjectsWriting a Program to Record Alignment StatisticsAdditional Pysam Features and Other SAM/BAM APIs
Basic Bash ScriptingWriting and Running Robust Bash ScriptsVariables and Command ArgumentsConditionals in a Bash Script: if StatementsProcessing Files with Bash Using for Loops and GlobbingAutomating File-Processing with find and xargsUsing find and xargsFinding Files with findfind’s Expressionsfind’s -exec: Running Commands on find’s Resultsxargs: A Unix PowertoolUsing xargs with Replacement Strings to Apply Commands to Filesxargs and ParallelizationMake and Makefiles: Another Option for Pipelines
Fast Access to Indexed Tab-Delimited Files with BGZF and TabixCompressing Files for Tabix with BgzipIndexing Files with TabixUsing TabixIntroducing Relational Databases Through SQLiteWhen to Use Relational Databases in BioinformaticsInstalling SQLiteExploring SQLite Databases with the Command-Line InterfaceQuerying Out Data: The Almighty SELECT CommandSQLite FunctionsSQLite Aggregate FunctionsSubqueriesOrganizing Relational Databases and JoinsWriting to DatabasesDropping Tables and Deleting DatabasesInteracting with SQLite from PythonDumping Databases
Where to Go From Here?

Content preview from Bioinformatics Data Skills

Chapter 7. Unix Data Tools

We often forget how science and engineering function. Ideas come from previous exploration more often than from lightning strokes.

John W. Tukey

In Chapter 3, we learned the basics of the Unix shell: using streams, redirecting output, pipes, and working with processes. These core concepts not only allow us to use the shell to run command-line bioinformatics tools, but to leverage Unix as a modular work environment for working with bioinformatics data. In this chapter, we’ll see how we can combine the Unix shell with command-line data tools to explore and manipulate data quickly.

Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming Pearls

Understanding how to use Unix data tools in bioinformatics isn’t only about learning what each tool does, it’s about mastering the practice of connecting tools together—creating programs from Unix pipelines. By connecting data tools together with pipes, we can construct programs that parse, manipulate, and summarize data. Unix pipelines can be developed in shell scripts or as “one-liners”—tiny programs built by connecting Unix tools with pipes directly on the shell. Whether in a script or as a one-liner, building more complex programs from small, modular tools capitalizes on the design and philosophy of Unix (discussed in “Why Do We Use Unix in Bioinformatics? Modularity and the Unix Philosophy”). The pipeline approach to building programs is a well-established tradition in Unix (and bioinformatics) ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Analytical Skills for AI and Data Science

Publisher Resources

ISBN: 9781449367480Errata Page

Bioinformatics Data Skills

by Vince Buffalo

Chapter 7. Unix Data Tools

Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming Pearls

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Analytical Skills for AI and Data Science

Practical Statistics for Data Scientists

R Programming for Statistics and Data Science

R for Data Science, 2nd Edition

Publisher Resources

Chapter 7. Unix Data Tools

Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming Pearls

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Analytical Skills for AI and Data Science

Practical Statistics for Data Scientists

R Programming for Statistics and Data Science

R for Data Science, 2nd Edition

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.