book

Data Science on the Google Cloud Platform

Name: Data Science on the Google Cloud Platform
Author: Valliappa Lakshmanan
ISBN: 9781491974513

by Valliappa Lakshmanan

December 2017

Beginner to intermediate

404 pages

11h 12m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Making Better Decisions Based on Data
Many Similar DecisionsThe Role of Data EngineersThe Cloud Makes Data Engineers PossibleThe Cloud Turbocharges Data ScienceCase Studies Get at the Stubborn FactsA Probabilistic DecisionData and ToolsGetting Started with the CodeSummary
2. Ingesting Data into the Cloud
Airline On-Time Performance DataKnowabilityTraining–Serving SkewDownload ProcedureDataset FieldsWhy Not Store the Data in Situ?Scaling UpScaling OutData in Situ with Colossus and JupiterIngesting DataReverse Engineering a Web FormDataset DownloadExploration and CleanupUploading Data to Google Cloud StorageScheduling Monthly DownloadsIngesting in PythonCloud FunctionsSecuring the URLScheduling the Cloud FunctionImproving the Cloud Function DesignSummaryCode Break
3. Creating Compelling Dashboards
Explain Your Model with DashboardsWhy Build a Dashboard First?Accuracy, Honesty, and Good DesignLoading Data into Google Cloud SQLCreate a Google Cloud SQL InstanceInteracting with Google Cloud PlatformControlling Access to MySQLCreate TablesPopulating TablesBuilding Our First ModelContingency TableThreshold OptimizationMachine LearningBuilding a DashboardGetting Started with Data StudioCreating ChartsAdding End-User ControlsShowing Proportions with a Pie ChartExplaining a Contingency TableSummary
4. Streaming Data: Publication and Ingest
Designing the Event FeedTime CorrectionApache Beam/Cloud DataflowParsing Airports DataAdding Time Zone InformationConverting Times to UTCCorrecting DatesCreating EventsRunning the Pipeline in the CloudPublishing an Event Stream to Cloud Pub/SubGet Records to PublishPaging Through RecordsBuilding a Batch of EventsPublishing a Batch of EventsReal-Time Stream ProcessingStreaming in Java DataflowExecuting the Stream ProcessingAnalyzing Streaming Data in BigQueryReal-Time DashboardSummary
5. Interactive Data Exploration
Exploratory Data AnalysisLoading Flights Data into BigQueryAdvantages of a Serverless Columnar DatabaseStaging on Cloud StorageAccess ControlFederated QueriesIngesting CSV FilesExploratory Data Analysis in Cloud AI Platform NotebooksJupyter NotebooksCloud AI Platform NotebooksInstalling Packages in Cloud AI Platform NotebooksJupyter Magic for Google Cloud PlatformQuality ControlOddball ValuesOutlier Removal: Big Data Is DifferentFiltering Data on Occurrence FrequencyArrival Delay Conditioned on Departure DelayApplying Probabilistic Decision ThresholdEmpirical Probability Distribution FunctionThe Answer Is...Evaluating the ModelRandom ShufflingSplitting by DateTraining and TestingSummary
6. Bayes Classifier on Cloud Dataproc
MapReduce and the Hadoop EcosystemHow MapReduce WorksApache HadoopGoogle Cloud DataprocNeed for Higher-Level ToolsJobs, Not ClustersInitialization ActionsQuantization Using Spark SQLJupyterLab on Cloud DataprocIndependence Check Using BigQuerySpark SQL in JupyterLabHistogram EqualizationDynamically Resizing ClustersBayes Classification Using PigRunning a Pig Job on Cloud DataprocAutomating Cloud Dataproc with Workflow TemplatesLimiting to Training DaysThe Decision CriteriaEvaluating the Bayesian ModelSummary
7. Machine Learning: Logistic Regression in Spark and BigQuery
Logistic RegressionSpark ML LibraryGetting Started with Spark Machine LearningSpark Logistic RegressionCreating a Training DatasetDealing with Corner CasesCreating Training ExamplesTrainingPredicting by Using a ModelEvaluating a ModelFeature EngineeringExperimental FrameworkCreating the Held-Out DatasetFeature SelectionScaling and Clipping FeaturesFeature TransformsCategorical VariablesScalable Machine Learning Models in BigQueryRepeatable, Real TimeSummary
8. Time-Windowed Aggregate Features
The Need for Time AveragesDataflow in JavaSetting Up Development EnvironmentFiltering with BeamPipeline Options and Text I/ORun on CloudParsing into ObjectsComputing Time AveragesGrouping and CombiningParallel Do with Side InputDebuggingBigQueryIOMutating the Flight ObjectSliding Window Computation in Batch ModeRunning in the CloudMonitoring, Troubleshooting, and Performance TuningTroubleshooting PipelineSide Input LimitationsRedesigning the PipelineRemoving DuplicatesSummary
9. Machine Learning Classifier Using TensorFlow
Toward More Complex ModelsReading Data into TensorFlowTraining and Evaluation in KerasModel FunctionInput and FeaturesTraining and Evaluating Input FunctionsSaving and ExportingPerforming a Training RunTraining in the CloudWide-and-Deep ModelHyperparameter TuningDeploying the ModelPredicting with the ModelExplaining the ModelSummary

10. Real-Time Machine Learning
Invoking Prediction ServiceJava Classes for Request and ResponsePost Request and Parse ResponseClient of Prediction ServiceAdding Predictions to Flight InformationBatch Input and OutputData Processing PipelineIdentifying InefficiencyBatching RequestsStreaming PipelineFlattening PCollectionsExecuting Streaming PipelineLate and Out-of-Order RecordsWatermarks and TriggersTransactions, Throughput, and LatencyPossible Streaming SinksCloud BigtableDesigning TablesDesigning the Row KeyStreaming into Cloud BigtableQuerying from Cloud BigtableEvaluating Model PerformanceThe Need for Continuous TrainingEvaluation PipelineEvaluating PerformanceMarginal DistributionsChecking Model BehaviorIdentifying Behavioral ChangeSummaryBook Summary
A. Considerations for Sensitive Data within Machine Learning Datasets
Handling Sensitive InformationIdentifying Sensitive DataProtecting Sensitive DataRemoving Sensitive DataMasking Sensitive DataCoarsening Sensitive DataEstablishing a Governance Policy
Index

Overview

Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build on top of the Google Cloud Platform (GCP). This hands-on guide shows developers entering the data science field how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on GCP. Through the course of the book, you’ll work through a sample business decision by employing a variety of data science approaches.

Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science.

You’ll learn how to:

Automate and schedule data ingest, using an App Engine application
Create and populate a dashboard in Google Data Studio
Build a real-time analysis pipeline to carry out streaming analytics
Conduct interactive data exploration with Google BigQuery
Create a Bayesian model on a Cloud Dataproc cluster
Build a logistic regression machine-learning model with Spark
Compute time-aggregate features with a Cloud Dataflow pipeline
Create a high-performing prediction model with TensorFlow
Use your deployed model as a microservice you can access from both batch and real-time pipelines

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical AI on the Google Cloud Platform

Publisher Resources

ISBN: 9781491974551Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills