book

Genomics in the Azure Cloud

Name: Genomics in the Azure Cloud
Author: Colby T. Ford
ISBN: 9781098139049

by Colby T. Ford

November 2022

Intermediate to advanced

327 pages

7h 20m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Should Read This BookHow the Book Is OrganizedSoftware and Hardware RequirementsCode Conventions and DownloadsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Essentials of Cloud Architecture
Cloud HorsepowerConsiderations for the CloudThree Benefits of the CloudTypes of Cloud ServicesInfrastructure ServicesPlatform ServicesSoftware ServicesAzure Environment OrganizationGetting an Azure AccountWelcome to the Azure PortalSetting Up a Resource GroupCreating ResourcesFree ServicesBasics of the Bioinformatics WorkflowPrimary AnalysisSecondary AnalysisTertiary AnalysisOther AnalysesOther File Formats
2. Organizing Genomics Data with Data Lakes
Organizing Your Genomics DataGoing for Bronze, Silver, and GoldLetting Your Bioinformatics Workflow Dictate Your Data Lake OrganizationPlanning for -omics and Non-omics Data TogetherCreating a Data Lake with Azure StorageBlob Storage Versus Data Lake StorageBalancing Costs Versus Performance in Data StorageThe Goldilocks Method of Storage TiersGenomics Data LifecycleManaging Access Inside the LakeRole-Based Access ControlAccess-Control Lists Azure Open Datasets for Genomics
3. Querying Variant Data in SQL
Building a Genomics Data WarehouseExample: Lab ResultsData Warehouse Architecture for GenomicsAzure Synapse AnalyticsCreating an Azure Synapse Analytics WorkspaceRegistering Services in SubscriptionsGetting to Work in the Synapse WorkspaceUsing Open Row SetsCreating External TablesDid Someone Say “Pool Party”?Connecting to More Data SourcesAzure SQL DBCreating a Database in Azure SQL DBRelaxing at Your Genomics Data LakehouseEfficient File Formats
4. Orchestrating Data Movement and Transformation
Creating Your Data FactoryGetting Started with Data MovementGetting Data into Your Data Lake Using the Copy Data ToolLinking to NCBI’s FTP ServerTransforming Data Using Data FlowsBuilding and Triggering Pipelines for Automation
5. Azure Databricks (and Apache Spark)
Introduction to Apache Spark and DatabricksSetting Up an Azure Databricks WorkspaceConnecting Databricks to Your Data LakeProcessing Variant Data with the Glow PackageExploring DataFramesAutomating Variant Data ProcessingOrchestrating a Databricks Notebook from Data FactoryA Brief Interlude About Distributed File FormatsUsing Other Tools in DatabricksSingle-Node Bioinformatics ToolsKoalasHail
6. Azure Machine Learning
How to Scale Machine Learning TasksCreating an Azure Machine Learning WorkspaceTraining a Drug Sensitivity ModelCreating a Compute Instance in Azure Machine Learning StudioDatastores and DatasetsExperimenting with Cluster-Based TrainingAutomating Model Training with AutoMLExplainable Machine LearningUsing Azure Machine Learning Not for Machine LearningPerforming Alignment in a NotebookCustom Docker Images for Bioinformatics
7. High-Performance Computing and Other Compute Services
Bring Your Own Pipeline (BYOP)Why Azure for HPC?Azure BatchScaling Workloads with CromwellAzure CycleCloudSetting Up CycleCloud ClustersMicrosoft GenomicsAlignment and Variant Calling with the msgen Package
8. Deployment, Security, Compliance, and Potpourri
Automating the Deployment of Cloud ResourcesDev, Staging, and ProdLifting Your Deployment with ARMs and BicepsSecurity PlanningAzure Active DirectoryRole-Based Access Controls and Access-Control ListsComplianceHIPAA, HITECH, and HITRUSTAzure BlueprintsCost ConsiderationsAzure Pricing CalculatorRetail Pricing Versus Enterprise AgreementsBudgeting ExamplesQuota ProblemsPlease, Sir, Can I Have Some More (vCPUs)?Getting General Support
Conclusion
Looking BackwardBaby AzureWhat Else?Using Other Web-Based Bioinformatics PlatformsLooking ForwardCheaper Sequencing = More Data

Index
About the Author

Content preview from Genomics in the Azure Cloud

Chapter 4. Orchestrating Data Movement and Transformation

When you work in a data-driven role (e.g., as a bioinformatician, data scientist, data analyst, etc.), having data available to you is paramount to being successful. With a cloud environment, a common hurdle that organizations have when they first start out is actually getting data into the cloud for everyone to use. Data orchestration encompasses the processes for getting data into and out of cloud resources and also managing certain data tasks. In this chapter, we’ll learn how to orchestrate data movement in the cloud by connecting to sources (often outside our cloud environment) and copying data over to our data lake or other destination.

In Azure, the standard tool for data orchestration is called Data Factory. This tool blends the capabilities of a traditional extract-transform-load (ETL) tool, like SQL Server Integration Services, with orchestration capabilities for queuing up external data tasks. With Data Factory, we’ll learn how to extract data from outside our Azure environment into our data lake and also learn how to transform and load that data into our data warehouse. See Figure 4-1.

Outside of Azure, there are certainly other third-party ETL tools available for you to purchase, but it’s worth giving Data Factory a shot as it is very well-integrated with the other Azure services that we cover in this book. Plus, Data Factory supports more than 90 built-in connectors to common enterprise platforms like SAP, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098139032Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Genomics in the Azure Cloud

by Colby T. Ford

Chapter 4. Orchestrating Data Movement and Transformation

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.