book

Advanced Analytics with Spark

by Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

April 2015

Intermediate to advanced

200 pages

7h 25m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
What’s in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Analyzing Big Data
The Challenges of Data ScienceIntroducing Apache SparkAbout This Book
2. Introduction to Data Analysis with Scala and Spark
Scala for Data ScientistsThe Spark Programming ModelRecord LinkageGetting Started: The Spark Shell and SparkContextBringing Data from the Cluster to the ClientShipping Code from the Client to the ClusterStructuring Data with Tuples and Case ClassesAggregationsCreating HistogramsSummary Statistics for Continuous VariablesCreating Reusable Code for Computing Summary StatisticsSimple Variable Selection and ScoringWhere to Go from Here
3. Recommending Music and the Audioscrobbler Data Set
Data SetThe Alternating Least Squares Recommender AlgorithmPreparing the DataBuilding a First ModelSpot Checking RecommendationsEvaluating Recommendation QualityComputing AUCHyperparameter SelectionMaking RecommendationsWhere to Go from Here
4. Predicting Forest Cover with Decision Trees
Fast Forward to RegressionVectors and FeaturesTraining ExamplesDecision Trees and ForestsCovtype Data SetPreparing the DataA First Decision TreeDecision Tree HyperparametersTuning Decision TreesCategorical Features RevisitedRandom Decision ForestsMaking PredictionsWhere to Go from Here
5. Anomaly Detection in Network Traffic with K-means Clustering
Anomaly DetectionK-means ClusteringNetwork IntrusionKDD Cup 1999 Data SetA First Take on ClusteringChoosing kVisualization in RFeature NormalizationCategorical VariablesUsing Labels with EntropyClustering in ActionWhere to Go from Here
6. Understanding Wikipedia with Latent Semantic Analysis
The Term-Document MatrixGetting the DataParsing and Preparing the DataLemmatizationComputing the TF-IDFsSingular Value DecompositionFinding Important ConceptsQuerying and Scoring with the Low-Dimensional RepresentationTerm-Term RelevanceDocument-Document RelevanceTerm-Document RelevanceMultiple-Term QueriesWhere to Go from Here
7. Analyzing Co-occurrence Networks with GraphX
The MEDLINE Citation Index: A Network AnalysisGetting the DataParsing XML Documents with Scala’s XML LibraryAnalyzing the MeSH Major Topics and Their Co-occurrencesConstructing a Co-occurrence Network with GraphXUnderstanding the Structure of NetworksConnected ComponentsDegree DistributionFiltering Out Noisy EdgesProcessing EdgeTripletsAnalyzing the Filtered GraphSmall-World NetworksCliques and Clustering CoefficientsComputing Average Path Length with PregelWhere to Go from Here
8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
Getting the DataWorking with Temporal and Geospatial Data in SparkTemporal Data with JodaTime and NScalaTimeGeospatial Data with the Esri Geometry API and SprayExploring the Esri Geometry APIIntro to GeoJSONPreparing the New York City Taxi Trip DataHandling Invalid Records at ScaleGeospatial AnalysisSessionization in SparkBuilding Sessions: Secondary Sorts in SparkWhere to Go from Here

9. Estimating Financial Risk through Monte Carlo Simulation
TerminologyMethods for Calculating VaRVariance-CovarianceHistorical SimulationMonte Carlo SimulationOur ModelGetting the DataPreprocessingDetermining the Factor WeightsSamplingThe Multivariate Normal DistributionRunning the TrialsVisualizing the Distribution of ReturnsEvaluating Our ResultsWhere to Go from Here
10. Analyzing Genomics Data and the BDG Project
Decoupling Storage from ModelingIngesting Genomics Data with the ADAM CLIParquet Format and Columnar StoragePredicting Transcription Factor Binding Sites from ENCODE DataQuerying Genotypes from the 1000 Genomes ProjectWhere to Go from Here
11. Analyzing Neuroimaging Data with PySpark and Thunder
Overview of PySparkPySpark InternalsOverview and Installation of the Thunder LibraryLoading Data with ThunderThunder Core Data TypesCategorizing Neuron Types with ThunderWhere to Go from Here
A. Deeper into Spark
SerializationAccumulatorsSpark and the Data Scientist’s WorkflowFile FormatsSpark SubprojectsMLlibSpark StreamingSpark SQLGraphX
B. Upcoming MLlib Pipelines API
Beyond Mere ModelingThe Pipelines APIText Classification Example Walkthrough
Index

Content preview from Advanced Analytics with Spark

Chapter 10. Analyzing Genomics Data and the BDG Project

Uri Laserson

So we need to shoot our SCHPON [...] into the void.

George M. Church

The advent of next-generation DNA sequencing (NGS) technology is rapidly transforming the life sciences into a data-driven field. However, making the best use of this data is butting up against a traditional computational ecosystem that builds on difficult-to-use, low-level primitives for distributed computing (e.g., DRMAA or MPI) and a jungle of semi-structured text-based file formats.

This chapter will serve three primary purposes. First, we introduce the general Spark user to a new set of Hadoop-friendly serialization and file formats (Avro and Parquet) that greatly simplify many problems in data management. We broadly promote the use of these serialization technologies to achieve compact binary representations, service-oriented architectures, and language cross-compatibility. Second, we show the experienced bioinformatician how to perform typical genomics tasks in the context of Spark. Specifically, we will use Spark to manipulate large quantities of genomics data to process and filter data, build a transcription factor binding site prediction model, and join ENCODE genome annotations against the 1000 Genome project variants. Finally, this chapter will serve as a tutorial to the ADAM project, which comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command-line tools for large-scale genomics analysis. Among other ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491912751Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Advanced Analytics with Spark

by Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

Chapter 10. Analyzing Genomics Data and the BDG Project

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.