CHAPTER 8Processing theSequencing Data

In this chapter, we run a pipeline to convert sequencing data into gene information. The sequencing data consists of millions of short reads generated by Illumina or BGI short read DNA sequencing. The gene information includes the DNA variants and mutations found in the genes of the person whose DNA was sequenced. When the variant is only 1 nucleotide long, it is called a single nucleotide variation (SNV) or single nucleotide polymorphism (SNP). Variants that are 2 to 50 nucleotides long are called indels. Anything bigger than that is called a structural variant (SV) or copy number variant (CNV).

Getting from Data to Information

To get from sequencing data to gene information, we can run the following steps, which make up our data processing pipeline:

  1. Align the sequencing reads to the reference genome to produce a BAM file.
  2. Make adjustments and refinements to the aligned reads in the BAM file.
  3. Identify the small differences (SNVs and indels) in this data compared to the reference genome and record them in the VCF file.
  4. Make adjustments and refinements to the variants in the VCF file.
  5. Annotate the SNVs and indels so that we will know whether they are inside a gene and whether the consequence of the variant might be deleterious.
  6. Prioritize the variants to be able to identify the most consequential ones.
  7. When analyzing a family or trio (mother, father, and child), carry out inheritance analysis to see which variants are in family members ...

Get Genomics in the AWS Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.