Microsoft Azure was publicly announced in October 2008 and started to take off in the mid-2010s. Over the last decade, Azure has seen a ton of growth and is now the second-largest public cloud, second only to Amazon Web Services. With Azure, creating scalable solutions for genomics is a great use of all the cloud has to offer. From data lakes to machine learning, you can store, process, and analyze huge quantities of genomics data using Azure services with ease.
In this chapter, I’ll introduce you to the basics of Azure along with some core cloud concepts. We’ll then cover some of the basics about how your Azure environment is organized and what the different categories of services are. Plus, I’ll walk through how to set up an Azure account and how to navigate the Azure Portal. This chapter speaks generally about cloud concepts, but we’ll focus on the Azure cloud specifically for the entire book. Next, we’ll go over a bit of background about the cloud, as this will help you understand how parts of the bioinformatics workflow fit within a cloud-based environment. At the end of the chapter, we’ll also cover some basics of the bioinformatics workflow just to get everyone up to speed with file formats and analyses common in this field.
Cloud Horsepower
In the late 19th century, as people transitioned their method of transportation from horses to cars, an obvious benefit was an increase in speed from point A to point B. While slow speed wasn’t always considered a problem with horse-based transport, care for these pre-automobile transport pets was. Feeding the horses, caring for them, having enough land to keep them, and dealing with the crap (literally) were all reduced or eliminated with the onset of the car.
This is a perfect analogy for the cloud-based genomics. The major cloud vendors often talk about the cloud as a means to scale your workloads—that is, speed them up to do what you do today but bigger, better, and faster. This is true with cloud-based genomics, but this is just the tip of the benefits iceberg. In addition to speed, the cloud reduces the need to manage the nitty-gritty components of the infrastructure, security, and segmentation of resources while giving you the ability to automate, collaborate, and centralize your workloads.
So, to belabor this analogy further, you could always better your genomics processes by buying equipment that adds horsepower locally, or you could rent a fleet of machines to meet the ever-changing and ever-growing needs of your organization.
Before I move on to describe each of the different types of cloud services in more detail, let’s first discuss some considerations for the cloud, followed by some of Azure’s main benefits.
Considerations for the Cloud
As a consultant, I’ve heard a ton of questions about the cloud, how it works, and why you should or shouldn’t use it. The real answer is: it depends. The “cloud” is a blanket term that describes renting compute time and resources from another organization (i.e., “renting someone else’s computer”). Though I hate to boil the cloud down to such a curt statement, it is a good concept to keep in mind as I try to demystify the cloud for you in this book. By paying for cloud resources for data storage and compute rather than buying your own, the cloud decreases the management effort of these items through the ability to template and automate. Plus, larger enterprise clouds, like Microsoft Azure, give you global data access and scalability.
Next, I’ll explain why the following three statements are common misconceptions.
“I have to move everything to the cloud at once.”
I’ve heard this fear from IT teams that focus on on-premises (on-prem) architecture. They feel like they need to migrate everything to the cloud at once for the cloud to be useful and adopted at the organization. This simply isn’t true. Your organization can have a “hybrid cloud” environment where some things are housed on-prem and some things are housed on the cloud. This is especially true when talking about organizations that have many legacy systems that would be very cumbersome to migrate. Moving data and systems to the cloud can be time-consuming, challenging, and expensive, so there is a need to plan accordingly before biting off more than you can chew.
I often recommend that organizations start small and with a pain point. For example, if you have an analysis that takes too long to run, work on migrating it and only the required data for it to complete. Then you can start to evaluate the effectiveness of the cloud for your organization, get buy-in from teammates, and demonstrate the capabilities on a small scale. An added benefit of this is that a smaller starter project is easier to estimate in terms of costs and scope. If the smaller project is successful, it will help to create momentum and executive buy-in. If you start small, you spend small.
“The cloud is always cheaper/more expensive.”
When large organizations buy on-prem hardware (like a computing cluster), they can easily spend hundreds of thousands to millions of dollars on physical equipment and then tons more money on software licenses. But once this initial capital expense is done, they can use this equipment without worrying about a monthly bill (besides electricity and internet and people costs, of course).
With the cloud, you’re effectively paying for what you use. So being smart when architecting solutions is key to cost savings. Using the appropriate-sized cloud services and turning things off when they’re not in use are the two most basic principles for cost management. From an accounting perspective, cloud services are an operational expense that are expected to be incurred monthly.
If you like analogies, this is similar to buying a home versus renting. You get more freedom when you buy, but you carry more of the risks and responsibilities along with it. You can rent an apartment and not have to mow the lawn or replace the dishwasher, but you don’t own anything at the end of the day. Neither option is best for all people in all situations.
In some cases, owning is cost prohibitive due to people, hardware, and software costs. In contrast, the cloud provides a more pay-as-you-go feel, with less need to manage infrastructure and pay for individual software licenses.
“Our IT security team can manage security better.”
In short, I doubt it. All of the major cloud vendors, including Microsoft, spend billions of dollars on cybersecurity. It behooves them to never have any security incidents, so they do everything they can to ensure that doesn’t happen. This doesn’t mean that cyberattacks and security incidents don’t happen in the cloud, but Azure provides a state-of-the-art barrier to protect their cloud and gives users an easier way to manage and monitor their specific resources.
Also, this is kind of the wrong way to think about it. The cloud security stuff doesn’t replace your cybersecurity team; it augments and enhances their abilities. They’ll still get tons of control over security in your cloud environment, and it’ll be backed by best-in-class cloud security services. The Azure cloud comes with tons of features like automatic threat detection, automated cloud audits, DDoS protection, and more. In fact, many of these features are free or very cheap in the cloud, which is something you don’t necessarily get out of the box using on-prem appliances.
Azure also provides standards-based compliance for organizations that must comply with regulatory standards. Whether it’s HIPAA, HITRUST, SOC2, NIST, or some ISO standard, Azure provides blueprints to help you implement these controls and meet whatever compliance standard necessary.
I will also cover compliance considerations more in depth in “Compliance”.
Three Benefits of the Cloud
If you’re used to doing bioinformatics works using your local workstation or your group’s computing equipment, you’ve likely experienced limitations in what you can accomplish. Data is segmented or difficult to find, you have to wait on other people’s analyses to finish before you can run your stuff, and collaborating is limited to everyone crowding around a bench in the lab.
From the various research groups I’ve helped move to the cloud, here are some of the common benefits I’ve seen.
Collaboration
In creating a central location for your genomics data, you’ll be improving the accessibility of the data to your team and organization. This helps to reduce redundant and old versions of data and cuts down on data silos.
Certain services, like Azure Machine Learning and Databricks, also support collaboration by providing shared workspaces where individuals can contribute together. Outside of the Azure cloud services, tools like Microsoft 365 with Teams and Azure DevOps or GitHub (git-based code repository services) also support collaboration while integrating well with the overall cloud development lifecycle.
Scalability
Scalability is a term used to describe how your performance changes as you try to do more work (such as process more data). Architectures that are not scalable may cause bottlenecks when the amount of data or requests grow above a certain threshold. In contrast (and in a perfect world), scalable architectures handle workload fluctuations with ease and grow to fit the demand in an attempt to keep the overall performance (i.e., speed or efficiency) stable.
Using distributed computing services like Apache Spark in Databricks or Azure Batch, you can reduce time to insight by elastically scaling compute contexts for genomics. Specifically, you can reduce processing times for secondary and tertiary analyses and machine learning by harnessing the power of scalable computing services in the cloud.
Automation
Using cloud services like Azure Data Factory, you can orchestrate and automate processing pipelines. This means that workflows can be triggered when new data becomes available, kicking off and processing your data without the need for a bioinformatician to click “Run.”
In addition, you can streamline the availability of data to other systems. This is perfect for keeping dashboards and reports up-to-date with the latest results.
In this section, we’ve discussed some of the benefits and considerations for using the Azure cloud. Designing and planning your cloud environment will be crucial in ensuring you achieve the best benefits while mitigating costs and friction in its usage.
Types of Cloud Services
Within the cloud, there are different types of services. Why is this important to know? Well, different service types are charged differently to your subscription, as shown in Figure 1-1. They can have large cost implications if you know how to use the different services appropriately.
Infrastructure Services
The first and most basic type of services are called infrastructure as a service (IaaS). These are usually underlying services that power the more abstracted services that we’ll talk about in a second. IaaS services include things like networking, firewall, and virtual machines.
With IaaS services, Azure will spin up the infrastructure, but you can install, configure, and manage the services as you wish. This offers the highest level of customization, but at the expense of having to manage the services more intently.
IaaS options are often charged by the minute within the billing period. So you’re paying for them while they are in use. For example, let’s say you have a virtual machine running in Azure. While that machine is on, your subscription is being charged by the minute. However, once you turn it off, the charges stop (except for any disk storage or reserved IP addresses that you’re using).
Example: Genomics Data Science Virtual Machine
One example of a useful IaaS offering in Azure for genomics is the Genomics Data Science Virtual Machine (GDSVM). This is a custom virtual machine image that includes many software packages that you as a bioinformatician (or a data scientist) may find useful. You can deploy this as either an Ubuntu 18.04 or Windows Server 2016 machine. Once the machine is deployed in your subscription, you’ll be able to remote into the machine (using SSH or RDP), where you’ll find all the popular data science tools like Jupyter, RStudio, Keras, Tensorflow, etc., along with bioinformatics tools like Bioconductor and GATK.
Platform as a service (PaaS) refers to services that supply a more on-demand environment without the need to manage all of the underlying infrastructure. Common Azure PaaS options include databases as a service (like Azure SQL DB, Synapse Analytics, and Cosmos DB) or machine learning and Databricks. These are often simply scalable cloud versions of platforms that could, in theory, be installed and managed locally.
With a PaaS service, Azure will manage the underlying infrastructure (such as the storage, compute, and networking parts), but you can often customize the performance tier that you’d like to use. These services are usually very easy to deploy and get started but still offer a lot of flexibility in their usage and scale. In addition, many Azure services offer “serverless”1 options that allow you to use a particular tool, such as a database, without managing the underlying server infrastructure at all.
PaaS options are often charged at two levels: 1) a smaller fee for using the platform; and 2) charges for the underlying infrastructure. For example, with Databricks, you’ll incur costs for the underlying virtual machines (VMs) that get spun up as part of a cluster plus a “Databricks Unit (DBU)” fee on top of the VM cost per hour. With this consumption-based pricing, you simply pay for the size, performance tier, or usage as you go.
Example: Azure database for PostgreSQL
Let’s say, for example, you want to use PostgreSQL to house data for a particular application you’re building. You have a couple of options in Azure: (1) spin up a VM and then install PostgreSQL on it (the IaaS option); or (2) use the PaaS option, which is to deploy the Azure Database for PostgreSQL directly.
If you opt for the Azure Database for PostgreSQL option, you’ll have a couple choices with regard to how you want to use the service, as shown in Figure 1-2.
Note how there are a couple general size options for database. Once you select one of these, you won’t have to manage the underlying machines on which the database runs. You’ll just be able to scale the database from the Portal UI and interact with it using SQL as always. For this service, your subscription will be charged by the hour (basically just the VM cost since this is an open source/free application).
Software Services
Software as a service (SaaS) refers to software applications that are available directly in the cloud without managing any underlying infrastructure or installing the application at all. In fact, you’ve probably used more SaaS applications without even realizing it. Cloud tools like Google Docs or Word Online are basically just SaaS versions of word processing software. The same idea applies in Azure with tools that are made available via APIs or web-based portals. For example, common SaaS offerings in Azure include the Cognitive Services (pre-trained AI models as APIs) or, more specific to bioinformatics, the Azure Genomics service (which we’ll cover later).
With an SaaS service, Azure will completely manage the underlying infrastructure, and there is often little opportunity to customize anything.
From a cost perspective, API-based SaaS options are usually charged by the API call (transaction). So this means a flat fee for a bundle of API requests. For other application-based SaaS options, you may be charged licensing and then a stationary fee per day/month.
Now that you understand the different types of services you might spin up in Azure, we’ll move on to understanding how Azure is organized. It will help to know where you can place all the services as you create them.
Azure Environment Organization
To avoid future confusion, let’s walk through the different levels of organization that occur in Azure. This will help you to understand how services are organized, how billing occurs, and how access is controlled later.
Tenant
The tenant is the top-level grouping—aligned to an Azure Active Directory (set of accounts)—for an organization:
Usually, organizations have only a single tenant for the entire company.
There are uncommon instances where an organization has more than one tenant to delineate accounts as much as possible. For example, an academic medical center may have a tenant for the university and a separate tenant for the medical arm of the organization.
Subscriptions
Subscriptions are the billable units within a tenant. Subscriptions come in a few different plan types, and it’s a little opaque, but I’ll try to cover a few of the common ones here:
Pay-as-You-Go (PAYG)
This plan is available without a long-term commitment and provides access to Azure services at full retail pricing. You can pay for this type of subscription using a credit card only. This is the best type of subscription for teams who are just getting started on Azure. There is no difference in pricing for development workloads versus production workloads with this type of subscription.
Enterprise Agreement
This plan usually requires a 12-month commitment and is common among larger organizations with higher Azure usage. This type of subscription provides special pricing for certain services if the organization commits to a certain financial spend on Azure services. Usually, organizations use this subscription type for production workloads as it offers the best service-level agreements and support.
Enterprise Dev/Test
This plan is similar to the Enterprise Agreement plan but offers much lower costs for development and testing workloads on Azure. Organizations use subscriptions of this type as a cost-saving measure when teams are working on developing a new product, platform, or architecture.2
Resource Groups
Resource groups are like logical containers for your resources. Often, resource groups hold services that are all related/used for a given solution. For example, you may choose to house all of the services related to your enterprise data warehouse (such as a storage account, a data factory instance, and a Synapse Analytics workspace) within the same resource group. Resource groups allow you to delete a set of resources in Azure, making it easier to manage sets of resources. Also, within the billing breakdown under a subscription, you can also see costs by resource group, which provides an easy way to understand how much a set of resources for a given purpose are costing you. In addition to this, you can set budgets by resource group or subscription.
Regions
Resource groups and individual services are placed in a region. This is a geographical location of an Azure data center where your services and data are hosted. Generally, you pick a region that is closest to you (or your customers), which should reduce latency in getting access to your applications or data. You can select from more than 40 regions across the globe. Most regions have access to most Azure services, but there are some regions that may not have some of the brand-new services (yet). There are also special regions that meet specific security and compliance needs, like the Azure government cloud regions, which are for government entities such as the United States Department of Defense.
If you’re just working in your own personal account, this is less important since it’s quite simple: you have a single tenant of one user (you) that has a single subscription (the one that Azure makes for you) with one or more resource groups of services.
If you’re using your organization account (from work or school), the tenant is usually managed by IT and includes all the users in the organization. Then the tenant will have multiple subscriptions: usually one for each group in the organization (or three if the organization separates development, staging/testing, and production levels)—or whatever makes sense from a billing/budget perspective. Finally, each subscription will have one or more resource groups of services within them. See Figure 1-3.
Note how, in Figure 1-3, you can organize your tenant by making subscriptions for dev/test or production purposes for various applications or teams. In addition, for business-critical applications, it’s common to see resource groups mirrored between development and production environments. This allows for a more controlled promotion of applications into a production environment once the application has been deemed stable and ready for wider organizational use.
Now that you have the basics of Azure under your belt, let’s start setting up your very own Azure account.
Getting an Azure Account
If you work for an organization that is already using Azure, you don’t necessarily need to create your own account. Just ask your IT department (or whoever manages your Azure environment) to create a sandbox subscription for you to play in.
However, if you want to have your own personal Azure account to play around in, I’ll show you how:
If you’d like to get $200 in free credits, click the “Start free” button, shown in Figure 1-4. Otherwise, click “Pay as you go” and then “Get started.”3
Either option will take you to a login screen, as shown in Figure 1-5, where you’ll need a Microsoft account. If you already have one (including a GitHub account), go ahead and sign in. Otherwise, create one. (If you are a student, use your university email.)
In Figure 1-6, you’ll input some information to set up a subscription. If you’re opting to get the $200 in free credits, this process will give you a free subscription that you will convert to a PAYG subscription later. If you’re just going straight for the PAYG option, you won’t get the credits, but you won’t have to convert the subscription, and you’ll be able to spin up any Azure service immediately.
Once you fill out this form and click “Sign up,” wait for an email stating that your subscription has been set up. (This could take a little while.) You will then be able to log in to the Azure Portal.
Welcome to the Azure Portal
The Azure Portal is the main entry point to interact with your cloud environment. Once logged in, you can begin spinning up resources as you wish.
Setting Up a Resource Group
Let’s walk through a few facets of the Portal and get some things set up first:
The Azure Portal will automatically show the commonly accessed resources right from the main screen. The first time that you log in, though, it will look something like Figure 1-7.
Clicking the gear icon in the top bar will open “Portal settings,” as shown in Figure 1-8. You may wish to customize the color theme or whether to show the recent items on the startup page or show a dashboard instead.
From the main screen, click “Resource groups” and then click the “+ Create” button to set up your first resource group (see Figure 1-9).
A new blade (this is the name for these panels that open in the portal) will open up where you can name the resource group and pick the region. Generally, I would recommend adding -rg to the end of your desired name. You can also add -dev or -test or -prod to specify if the resource group is to be used for development, testing, or production purposes, respectively. Finally, pick the region that’s closest to you (if you’re using your personal account) or pick the region where your organization keeps its other Azure resources (if you’re using your work or school account). See Figure 1-10.
Once you’ve specified the name and region for the resource group, click the “Review + create” button. This will perform a quick check to make sure the resource group can be created. Finally, click the “Create” button, as shown in Figure 1-11, to have Azure start the process to create the resource group. This will take a minute or so.
Nice! You now have a resource group to work with. Shall we put something in it?
Creating Resources
Back on the main portal screen, click the “+ Create a resource” button. This will open the “Create a resource” menu, as shown in Figure 1-12, where you can search for whatever product you’d like to provision.
You can click any of the popular products that show up on the main list, browse any of the categories on the left menu, or simply search for what you’re looking for. As an example, let’s create a storage account.
In the search box, type “storage” and press the Enter key. This will return a list of results that match this query and, ideally, the first option is “Storage account,” as shown in Figure 1-13. Click it and click the “Create” button on the next screen.
For any resource that you want to provision, there are often lots of options that you must specify. For a basic storage account, you’ll just need to provide a name for it and a resource group and region to place it in. Once you’ve specified these things, you can click over to the “Review + create” tab and click the “Create” button (once it has passed the validation, of course). See Figure 1-14.
Note
There are plenty more options to tweak with Azure Storage. Later in this book, we’ll walk through how to provision a data lake with geo-redundancy and premium performance, which is a much fancier version of what we’re doing now.
Once this storage account has finished being deployed, you should then see it on your main portal page under “Recent resources,” as shown in Figure 1-15. You can also navigate back to your resource group from before, and you’ll see it there as well.
Congrats! You’ve just created your first resource in Azure.
In this book, we’ll be spinning up multiple types of services, but the process is quite similar to what you’ve just done in this example. Feel free to refer to this section anytime you need a refresher on how to create a service in Azure. In addition to creating services using the Azure Portal (the easiest way), experienced developers, cloud engineers, or IT professionals may find it useful to create resources using the Azure CLI or PowerShell or using infrastructure as code tools (which we’ll talk about in Chapter 8).
Free Services
If this is your first foray into Azure or into the cloud in general, you can play with plenty of services for free.4 This is perfect for learning your way around or prototyping some solutions on the cheap.
From the Azure Portal, you may have a tile at the top under “Azure services” that says “Free services.” Otherwise, you can search for “Free services” under “All services” on the side menu, as shown in Figure 1-16.
This will bring up a list of services that are available at no cost. Notice that some are free for a certain tier for a limited number of hours, like B-tier VMs, or for a specific amount of usage, like Blob storage or some databases. Many of the APIs, such as the Azure Cognitive Services, are free for a certain number of transactions. See Figure 1-17 for some examples of free services you can use in Azure.
You can spin up these services right from this screen. You can also upgrade many of these services once your free time is up or if you need a higher tier of service as you grow.
Also, you don’t necessarily have to spin up services from here. For example, if you want to provision the GDSVM from the section “Infrastructure Services,” you can simply select the “B1s” VM tier, which is free for 750 hours, and give it a try.
Basics of the Bioinformatics Workflow
If you’re reading this book, I’m assuming you know a bit about bioinformatics and genomics already. However, I will provide a quick refresher on some of the basics since the terminology I use in this book may differ from what you’ve learned in the past.
In bioinformatics, it’s common to lay out your analysis steps as steps in a pipeline. Each of these steps processes the input data and converts the information into a more usable form. Beyond complex pipelines, another aspect that makes bioinformatics challenging from a computing perspective is the esoteric file types you may encounter. So, let’s walk through the standard steps of a bioinformatics pipeline at a high level (shown in Figure 1-18) to understand the forms the data takes at each stage.
Primary Analysis
Once a sample has been collected from a specimen (such as human blood, a bacterial swab, or plant matter), lab scientists can prepare this sample and load it into a sequencing machine. These machines will attempt to read the DNA (or RNA) of the sample and perform “base calling,” converting the biochemical signals from the machine into A’s, C’s, T’s (or U’s), and G’s. This sequencing process is called the primary analysis.
Usually, the output files of this process are raw sequence data in FASTA or FASTQ format.
FASTA
FASTA files are human readable, meaning they can be opened and read using any text editor on your computer. Each new sequence contains a metadata line that begins with a “>” character, usually followed by a sample ID and then some information about the sample. It’s also common for the metadata to be delimited by some other character (such as a “|”), but there is no standard for this.
For example, the following FASTA is part of the reference nucleotide sequence of SARS-CoV-2 from Wuhan, Hubei, China:5
FASTQ files are very similar to FASTA, except they also include some information about the quality (or confidence) of the base being reported at a given position. Each new sequence starts with an “@” symbol, followed by metadata. Then, the next few lines are 1) the nucleotide sequence, 2) a line with a “+” sign, and 3) a sequence of quality characters equal to the length of the nucleotide line:
The quality line might look a bit obscure, but it ranges from a “!” character as the lowest score to a “~” as the highest. The full range of ASCII characters (from worst quality to best) is as follows:
Once you have your sequence information, you can then derive more insights by aligning those sequences to reference sequences (or whole reference genomes) and then understand how your samples vary from that reference in a process called variant calling.
For many types of research studies, you may wish to align a whole genome (or exome) to a reference and then determine all of the variants the organism has. This is important in human genetics to understand how particular mutations relate to diseases. The pipelines that support these tasks often have multiple steps and may take a while to run given the amount of data being processed.
In other branches of genomics, you may not have nearly as much sequence information, so the process is much easier or more targeted. For example, when studying viruses (like SARS-CoV-2), the entire genome is quite small in comparison to humans or other organisms. Plus, you may care only about particular genes and how they’ve mutated. In these situations, the secondary analysis pipelines may be shorter and quicker as there’s less work to be done.
SAM (and BAM)
The sequence alignment map (SAM) format is a human-readable file type that stores the alignment results of an input sequence against a provided reference. Here are some example lines of a SAM file:
The lines at the top of the file (beginning with an “@” symbol) contain metadata about the alignment. In the previous example, the @SQ lines provide information about the reference sequences that were used, their names (e.g., chr1), and their length (denoted by the LN). The metadata lines can also hold information about the file itself, any read group information, etc.
The body of a SAM file contains a standard set of fields as shown in Table 1-1.
Table 1-1. SAM format column specifications
Column
Name
Description
Data type
1
QNAME
Query NAME of the read or the read pair
String
2
FLAG
Bitwise FLAG (pairing, strand, mate strand, etc.)
Integer
3
RNAME
Reference sequence NAME
String
4
POS
1-based leftmost POSition of clipped alignment
Integer
5
MAPQ
MAPping Quality (phred-scaled)
Integer
6
CIGAR
Extended CIGAR string
String
7
MRNM
(or RNEXT)
Mate Reference NaMe
or Reference name of the NEXT read
('=' if same as RNAME)
String
8
MPOS
(or PNEXT)
1-based leftmost mate POSition
or position of the NEXT read
Integer
9
ISIZE
(or TLEN)
Inferred Insert SIZE
or observed template LENgth
Integer
10
SEQ
Query SEQuence on the same strand as the reference
String
11
QUAL
Query QUALity (ASCII-33=phred base quality)
String
However, your file may not have the mate columns 7 and 8 if you don’t have information about the next mapped read. Also, information can be added starting in column 12 if there’s more data to include.
Binary alignment map (BAM) is simply the binary version of a SAM file. These are quite common as they save on storage space and are quicker to read using computers, but they are not human readable without a piece of software (usually Samtools).6
VCF
Variant calling is one of the most important parts of secondary analysis as this is the step that distills the information from the alignment down to the more useful bits of information: positions where your sample’s sequence varies from the reference. Variant calling is used to provide a list of mutations by position for a given sample, which is useful in understanding diseases, evolution, and more.
The most common format you’ll encounter with variant data is the variant call format file (VCF), which looks like this:
##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36 ##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description='Number of Samples With Data'>
##INFO=<ID=AN,Number=1,Type=Integer,Description='Total number of alleles in
called
genotypes'> ##INFO=<ID=AC,Number=.,Type=Integer,Description='Allele count in
genotypes'> ##INFO=<ID=DP,Number=1,Type=Integer,Description='Total Depth'>
##INFO=<ID=AF,Number=.,Type=Float,Description='Allele Frequency'>
##INFO=<ID=AA,Number=1,Type=String,Description='Ancestral Allele'>
##INFO=<ID=DB,Number=0,Type=Flag,Description='dbSNP membership, build 129'>
##INFO=<ID=H2,Number=0,Type=Flag,Description='HapMap2 membership'>
##FILTER=<ID=q10,Description='Quality below 10'>
##FILTER=<ID=s50,Description='Less than 50% of samples have data'>
##FORMAT=<ID=GT,Number=1,Type=String,Description='Genotype'>
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description='Genotype Quality'>
##FORMAT=<ID=DP,Number=1,Type=Integer,Description='Read Depth'>
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description='Haplotype Quality'>
##ALT=<ID=DEL:ME:ALU,Description='Deletion of ALU element'>
##ALT=<ID=CNV,Description='Copy number variable region'>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
19 111 . A C 9.6 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3
19 112 . A G 10 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ
0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ
0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.,.
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB
GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2...
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ
0|0:54:.:56,60 0|0:48:4:51,51 0/0:61:2:.,.
20 1234567 microsat1 G GA,GAC 50 PASS NS=3;DP=9;AA=G;AN=6;AC=3,1
GT:GQ:DP 0/1:.:4 0/2:17:2 1/1:40:3
20 1235237 . T . . . . GT 0/0 0|0 ./.
The header lines of a VCF file (denoted by ##) contain metadata information about the file. This usually includes information about the software that generated the VCF file, what reference was used, etc. In addition, these lines may contain additional information about certain ALT, INFO, and FORMAT attributes.
In general, the body of the VCF file contains the columns described in Table 1-2.
Table 1-2. VCF file column specifications
Column
Name
Description
1
CHROM
The name of the sequence, usually an integer referencing a chromosome or a string such as “chr1,” “chrX,” “MT,” etc.
2
POS
The 1-based position of the variation on the given sequence
3
ID
The identifier of the variation, often a dbSNP rs identifier (or a “.” if unknown)
4
REF
The reference base (or bases if describing an indel) at the given position
5
ALT
The alternative allele(s) at this position
6
QUAL
A quality score of the inference of the given alleles
7
FILTER
A flag indicating which of a given set of filters the variation has failed (or “PASS” if all the filters were passed)
8
INFO
A list of key-value pairs (attributes) describing the variation
9
FORMAT
A list of fields for describing the samples under which the VCF file is describing (optional)
10+
SAMPLEs
For each sample described in the file, the values of the fields listed in FORMAT
The VCF format is quite flexible, which can sometimes make it a pain to use if you have VCFs that were generated by different tools and contain lots of different attributes. Specifically, VCFs can describe one sample, multiple samples, one chromosome, multiple chromosomes, a whole genome, etc., and contain a wide range of INFO attributes, which are usually added in the tertiary analysis step.
Tertiary Analysis
The final step of the main bioinformatics workflow is tertiary analysis, which usually contains the processes to annotate the variants discovered in the previous step. It is important to understand not only how your sample may vary from a reference sequence but also what this variation may mean functionally. To answer this question, you can annotate the variants to provide information on the clinical or functional significance on the organism.
Some common annotation tools include SnpEff, VEP, and ANNOVAR. Each annotation tool adds useful information about the variant and may predict the functional effect of a given mutation. This gets appended to the INFO column of your VCF file for each line after the previous attributes from the variant calling process:
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 123456 . A T . . NS=3;DP=14;AF=0.0005;ANN=T|...
chr1 123457 . C G . . AF=0.0005;ANN=C-chr1:123456_A>T|...
This annotation (“ANN”) attribute may tell you the type of variant, the level of impact this variant has (LOW, MODERATE, HIGH, etc.), other ancillary information about the variant’s position (the gene ID or transcript ID), and the respective cDNA nucleotide change and protein amino acid change. Here’s an example of what an ANN attribute for a single VCF line might look like after annotation using SnpEff:
Most of the time annotation tools are simply performing a database lookup of the mutation and position from each record in your VCF. This means that these tools will simply add attributes to the INFO field, which are usually delimited by a different character. Notice that, in the previous example, the annotations are pipe-delimited (“|”).
Other Analyses
Outside of the traditional bioinformatics workflow, there are plenty of other tasks you may find yourself doing. Here are some other analyses that you may perform on the results of one of the previous steps or other data altogether:
Machine Learning
Machine learning models are trained to predict various items. For example, you could predict drug sensitivity based on gene expression or disease status based on diagnostic test results. (Chapters 5 and
6 will cover the tools perfect for machine learning in Azure.)
Biostatistics
Need to understand correlation (which is not equal to causation) or the statistical significance between the differences of groups? Biostatistical analyses are important in quantifying the confidence.
Protein Structure Modeling
Outside of the normal nucleotide analysis, you can also look into amino acid sequences and the role proteins play in an organism.
Reporting and Visual Analytics
Usually as the final step, you need to summarize your results and make them digestible for others in reports. Visualizing the outputs of your analyses helps us to convey your results in an understandable way.
Other File Formats
In addition to the formats described, you may see other file formats that help to run one of the analyses or serve as a different view of the same data.
GEN (and BGEN)
The GEN format is a file format developed by researchers at Oxford University that mainly holds the genotype information for a sample. This is similar to the VCF format in that it contains single nucleotide polymorphisms (SNPs), an ID for the variant, and the alleles seen in the sample. The binary form of this format, called BGEN, is useful due to its very small file size and fast readability by computers (as compared to human-readable formats like VCF).
GFF
The general feature format (GFF) is one of the most broadly purposed formats you may come across. This format was designed to house many types of features for nucleic acid and protein sequences. This can include features such as microRNAs, open reading frames (ORFs), binding domains, exons, and more. The latest version of the GFF is called GFF3, which looks like this:
The standard specification of columns for GFF3 is shown in Table 1-3.
Table 1-3. GFF3 column specifications
Column
Name
Description
1
Sequence ID
The identifier given to the feature’s sequence. Could be an accession number or any unambiguous ID.
2
Source
Describes the algorithm or procedure that generated this feature (e.g., GenBank, Protein Homology, EMBL).
3
Feature Type
The type of feature being described (e.g., mRNA, domain, exon).
4
Feature start
The 1-based starting coordinates of the feature.
5
Feature end
The ending coordinates of the feature.
6
Score
A score for sequence similarity (E-values) or predictions (P-values).
7
Strand
The strand from which the feature was derived.
8
Phase
Indicates where the feature begins with reference to the reading frame, commonly used in CDS features.
9
Attributes
A list of key-value pairs, delimited by a semicolon, that provides additional information about each feature.
PDB
If you’re a researcher who works in protein analysis, you’ll likely come across Protein Data Bank (PDB) files. These files contain the three-dimensional coordinates of the residues in a protein structure.
Here are some sample lines from the SARS-CoV-2 Omicron variant spike structure PDB file:7
HEADER VIRAL PROTEIN 19-DEC-21 7T9J TITLE CRYO-EM STRUCTURE OF THE
SARS-COV-2 OMICRON SPIKE PROTEIN
...
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS
SOURCE 3 2;
SOURCE 4 ORGANISM_TAXID: 2697049;
SOURCE 5 GENE: S, 2;
SOURCE 6 EXPRESSION_SYSTEM: HOMO SAPIENS; SOURCE 7 EXPRESSION_SYSTEM_TAXID:
9606
KEYWDS SARS-COV-2, GLYCOPROTEIN, FUSION PROTEIN, VIRAL PROTEIN EXPDTA
ELECTRON MICROSCOPY
AUTHOR X.ZHU,D.MANNAR,J.W.SAVILLE,S.S.SRIVASTAVA,A.M.BEREZUK,K.S.TUTTLE,
AUTHOR 2 S.SUBRAMANIAM
REVDAT 1 29-DEC-21 7T9J 0
JRNL AUTH D.MANNAR,J.W.SAVILLE,X.ZHU,S.S.SRIVASTAVA,A.M.BEREZUK, JRNL AUTH 2
K.S.TUTTLE,C.MARQUEZ,I.SEKIROV,S.SUBRAMANIAM
JRNL TITL SARS-COV-2 OMICRON VARIANT: ACE2 BINDING, CRYO-EM STRUCTURE JRNL
TITL 2 OF SPIKE PROTEIN-ACE2 COMPLEX AND ANTIBODY EVASION
JRNL REF TO BE PUBLISHED
JRNL REFN
Note that there are many different line entry types that can be added in a PDB file. This includes the structure’s title and author information, the journal article (if it’s been published in a paper), and source organism metadata:
HEADER VIRAL PROTEIN 19-DEC-21 7T9J TITLE CRYO-EM STRUCTURE OF THE SARS-COV-2
OMICRON SPIKE PROTEIN
...
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS SOURCE 3 2;
SOURCE 4 ORGANISM_TAXID: 2697049;
SOURCE 5 GENE: S, 2;
SOURCE 6 EXPRESSION_SYSTEM: HOMO SAPIENS; SOURCE 7 EXPRESSION_SYSTEM_TAXID:
9606
KEYWDS SARS-COV-2, GLYCOPROTEIN, FUSION PROTEIN, VIRAL PROTEIN EXPDTA
ELECTRON MICROSCOPY
AUTHOR X.ZHU,D.MANNAR,J.W.SAVILLE,S.S.SRIVASTAVA,A.M.BEREZUK,K.S.TUTTLE,
AUTHOR 2 S.SUBRAMANIAM
REVDAT 1 29-DEC-21 7T9J 0
JRNL AUTH D.MANNAR,J.W.SAVILLE,X.ZHU,S.S.SRIVASTAVA,A.M.BEREZUK, JRNL AUTH 2
K.S.TUTTLE,C.MARQUEZ,I.SEKIROV,S.SUBRAMANIAM
JRNL TITL SARS-COV-2 OMICRON VARIANT: ACE2 BINDING, CRYO-EM STRUCTURE JRNL
TITL 2 OF SPIKE PROTEIN-ACE2 COMPLEX AND ANTIBODY EVASION
JRNL REF TO BE PUBLISHED
JRNL REFN
In addition, it’s common for a PDB file to have a ton of REMARK lines, which usually describe the process of determining the structure, quality metrics, and really anything else that the author wants to share:
REMARK 2
REMARK 2 RESOLUTION. 2.79 ANGSTROMS.
...
REMARK 245 ELECTRON MICROSCOPE SAMPLE
REMARK 245 SAMPLE TYPE : PARTICLE
REMARK 245 PARTICLE TYPE : POINT
REMARK 245 NAME OF SAMPLE : SARS-COV-2 OMICRON SPIKE
REMARK 245 PROTEIN
REMARK 245 SAMPLE CONCENTRATION (MG ML-1) : NULL
REMARK 245 SAMPLE SUPPORT DETAILS : NULL
REMARK 245 SAMPLE VITRIFICATION DETAILS : NULL
REMARK 245 SAMPLE BUFFER : NULL
REMARK 245 PH : 8.00
REMARK 245 SAMPLE DETAILS : NULL
REMARK 245
...
REMARK 247 ELECTRON MICROSCOPY
REMARK 247 THE COORDINATES IN THIS ENTRY WERE GENERATED FROM ELECTRON
REMARK 247 MICROSCOPY DATA. PROTEIN DATA BANK CONVENTIONS REQUIRE
REMARK 247 THAT CRYST1 AND SCALE RECORDS BE INCLUDED, BUT THE VALUES
REMARK 247 ON THESE RECORDS ARE MEANINGLESS EXCEPT FOR THE CALCULATION
REMARK 247 OF THE STRUCTURE FACTORS.
...
PDB files also often contain a SEQRES section that lists the amino acid sequence of the structure. This is useful if you need the sequence for other bioinformatics analyses such as protein sequence alignment, etc.:
SEQRES 1 A 1285 MET PHE VAL PHE LEU VAL LEU LEU PRO LEU VAL SER SER
SEQRES 2 A 1285 GLN CYS VAL ASN LEU THR THR ARG THR GLN LEU PRO PRO
SEQRES 3 A 1285 ALA TYR THR ASN SER PHE THR ARG GLY VAL TYR TYR PRO
The most important section includes the lines that start with ATOM, which describe the position of the individual atoms of the amino acid residues:
ATOM
1
N
GLN
A
14
167.229
153.151
261.437
1.00179.65
N
ATOM
2
CA
GLN
A
14
166.857
151.813
260.994
1.00179.65
C
ATOM
3
C
GLN
A
14
167.456
151.510
259.625
1.00179.65
C
ATOM
4
O
GLN
A
14
167.347
152.316
258.703
1.00179.65
O
ATOM
5
CB
GLN
A
14
165.336
151.671
260.947
1.00179.65
C
ATOM
6
CG
GLN
A
14
164.672
151.610
262.310
1.00179.65
C
ATOM
7
CD
GLN
A
14
163.179
151.365
262.217
1.00179.65
C
ATOM
8
OE1
GLN
A
14
162.554
151.648
261.194
1.00179.65
O
ATOM
9
NE2
GLN
A
14
162.598
150.838
263.288
1.00179.65
N
ATOM
10
N
CYS
A
15
168.062
150.331
259.493
1.00181.02
N
ATOM
11
CA
CYS
A
15
168.663
149.888
258.244
1.00181.02
C
ATOM
12
C
CYS
A
15
168.213
148.468
257.935
1.00181.02
C
ATOM
13
O
CYS
A
15
168.011
147.653
258.839
1.00181.02
O
ATOM
14
CB
CYS
A
15
170.199
149.930
258.304
1.00181.02
C
ATOM
15
SG
CYS
A
15
170.926
151.538
258.689
1.00181.02
S
...
The last section I’ll mention includes the positions of atoms that are not part of the protein molecule. These lines are denoted by HETATM. In the next example, this is describing a small molecule ligand called 2-acetamido-2-deoxy-beta-D-glucopyranose (or NAG for short):
HETATM21152 C1 NAG D 1 173.940 145.791 256.091 1.00189.94 C
HETATM21153 C2 NAG D 1 174.055 145.911 257.627 1.00189.94 C
HETATM21154 C3 NAG D 1 175.262 146.769 258.009 1.00189.94 C
HETATM21155 C4 NAG D 1 176.533 146.234 257.359 1.00189.94 C
HETATM21156 C5 NAG D 1 176.341 146.159 255.845 1.00189.94 C
HETATM21157 C6 NAG D 1 177.522 145.560 255.117 1.00189.94 C
Both the ATOM and HETATM sections have a standard columnar format that includes the columns described in Table 1-4.
Table 1-4. PDB ATOM and HETATM column specification
Column
Positions
Description
1
1–6
“ATOM” for this section
2
7–11
Atom number
3
13–16
Atom name
4
17
Alternate location indicator
5
18–20
Residue name
6
22
Chain identifier
7
23–26
Residue sequence number
8
27
Code for insertion of residues
9
31–38
Orthogonal coordinates for X in Angstroms
10
39–46
Orthogonal coordinates for Y in Angstroms
11
47–54
Orthogonal coordinates for Z in Angstroms
12
55–60
Occupancy percentage
13
61–66
Temperature (“B”) factor (Default = 0.0)
14
73–76
Segment identifier
15
77–78
Element symbol
16
79–80
Charge on the atom
To render a PDB file as a 3D image, researchers often use desktop tools such as PyMol or web-based tools like 3Dmol.js and JSmol. For the previous file, we will generate something like Figure 1-19.
While there are many other file formats I could cover in this section, the ones mentioned are the most common. So I won’t drone on about file formats any longer, though I hope you will take the time to get familiar with the ins and outs of formats that you often use. It’ll help you when you need to come up with logic to process that data with various tools.
Now that you have your own Azure environment ready to go and you’re all up to speed on bioinformatics terminology, the next chapter will cover how to start setting up a data lake to house your genomics data. Care to take a dive in the lake?
In summary, the Microsoft Azure cloud, similar to other common enterprise cloud providers (Amazon Web Services and Google Cloud Platform), provides a variety of different types of services including IaaSs like virtual machines; networking and platform services (PaaS), such as Azure Machine Learning and Synapse Analytics; and software services (SaaS), such as hosted applications or APIs. The top level of cloud organization is the tenant, under which you’ll have one or more subscriptions. These subscriptions are what get billed when you use Azure services. In a given subscription, you may have one or more resource groups, which organize the services that you provision. Resource groups are often created for a given use case or a group of users.
There are lots of considerations for moving to the cloud. The cloud isn’t always cheaper or faster, but you don’t have to migrate everything to the cloud at once. One of the biggest pros of using the cloud is that it reduces the need to manage the underlying infrastructure as much, including things like security and networks. While Azure is used across the globe in finance, healthcare, retail, government, etc., the use of its enterprise-grade services has long been overlooked for genomics. In this book, we’re going to change that.
3 What’s the difference? The “free credits” option doesn’t allow you to spin up all of the Azure services, so you’ll have to convert to a PAYG subscription later anyway once the credits expire or if you need to spin up services that aren’t covered by the credits.
4 See “Explore free Azure services,” Azure, accessed September 15, 2022, https://oreil.ly/9Xoac.
Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.