book

Data Science on the Google Cloud Platform

Name: Data Science on the Google Cloud Platform
Author: Valliappa Lakshmanan
ISBN: 9781491974513

by Valliappa Lakshmanan

December 2017

Beginner to intermediate

404 pages

11h 12m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Making Better Decisions Based on Data
Many Similar DecisionsThe Role of Data EngineersThe Cloud Makes Data Engineers PossibleThe Cloud Turbocharges Data ScienceCase Studies Get at the Stubborn FactsA Probabilistic DecisionData and ToolsGetting Started with the CodeSummary
2. Ingesting Data into the Cloud
Airline On-Time Performance DataKnowabilityTraining–Serving SkewDownload ProcedureDataset FieldsWhy Not Store the Data in Situ?Scaling UpScaling OutData in Situ with Colossus and JupiterIngesting DataReverse Engineering a Web FormDataset DownloadExploration and CleanupUploading Data to Google Cloud StorageScheduling Monthly DownloadsIngesting in PythonCloud FunctionsSecuring the URLScheduling the Cloud FunctionImproving the Cloud Function DesignSummaryCode Break
3. Creating Compelling Dashboards
Explain Your Model with DashboardsWhy Build a Dashboard First?Accuracy, Honesty, and Good DesignLoading Data into Google Cloud SQLCreate a Google Cloud SQL InstanceInteracting with Google Cloud PlatformControlling Access to MySQLCreate TablesPopulating TablesBuilding Our First ModelContingency TableThreshold OptimizationMachine LearningBuilding a DashboardGetting Started with Data StudioCreating ChartsAdding End-User ControlsShowing Proportions with a Pie ChartExplaining a Contingency TableSummary
4. Streaming Data: Publication and Ingest
Designing the Event FeedTime CorrectionApache Beam/Cloud DataflowParsing Airports DataAdding Time Zone InformationConverting Times to UTCCorrecting DatesCreating EventsRunning the Pipeline in the CloudPublishing an Event Stream to Cloud Pub/SubGet Records to PublishPaging Through RecordsBuilding a Batch of EventsPublishing a Batch of EventsReal-Time Stream ProcessingStreaming in Java DataflowExecuting the Stream ProcessingAnalyzing Streaming Data in BigQueryReal-Time DashboardSummary
5. Interactive Data Exploration
Exploratory Data AnalysisLoading Flights Data into BigQueryAdvantages of a Serverless Columnar DatabaseStaging on Cloud StorageAccess ControlFederated QueriesIngesting CSV FilesExploratory Data Analysis in Cloud AI Platform NotebooksJupyter NotebooksCloud AI Platform NotebooksInstalling Packages in Cloud AI Platform NotebooksJupyter Magic for Google Cloud PlatformQuality ControlOddball ValuesOutlier Removal: Big Data Is DifferentFiltering Data on Occurrence FrequencyArrival Delay Conditioned on Departure DelayApplying Probabilistic Decision ThresholdEmpirical Probability Distribution FunctionThe Answer Is...Evaluating the ModelRandom ShufflingSplitting by DateTraining and TestingSummary
6. Bayes Classifier on Cloud Dataproc
MapReduce and the Hadoop EcosystemHow MapReduce WorksApache HadoopGoogle Cloud DataprocNeed for Higher-Level ToolsJobs, Not ClustersInitialization ActionsQuantization Using Spark SQLJupyterLab on Cloud DataprocIndependence Check Using BigQuerySpark SQL in JupyterLabHistogram EqualizationDynamically Resizing ClustersBayes Classification Using PigRunning a Pig Job on Cloud DataprocAutomating Cloud Dataproc with Workflow TemplatesLimiting to Training DaysThe Decision CriteriaEvaluating the Bayesian ModelSummary
7. Machine Learning: Logistic Regression in Spark and BigQuery
Logistic RegressionSpark ML LibraryGetting Started with Spark Machine LearningSpark Logistic RegressionCreating a Training DatasetDealing with Corner CasesCreating Training ExamplesTrainingPredicting by Using a ModelEvaluating a ModelFeature EngineeringExperimental FrameworkCreating the Held-Out DatasetFeature SelectionScaling and Clipping FeaturesFeature TransformsCategorical VariablesScalable Machine Learning Models in BigQueryRepeatable, Real TimeSummary
8. Time-Windowed Aggregate Features
The Need for Time AveragesDataflow in JavaSetting Up Development EnvironmentFiltering with BeamPipeline Options and Text I/ORun on CloudParsing into ObjectsComputing Time AveragesGrouping and CombiningParallel Do with Side InputDebuggingBigQueryIOMutating the Flight ObjectSliding Window Computation in Batch ModeRunning in the CloudMonitoring, Troubleshooting, and Performance TuningTroubleshooting PipelineSide Input LimitationsRedesigning the PipelineRemoving DuplicatesSummary
9. Machine Learning Classifier Using TensorFlow
Toward More Complex ModelsReading Data into TensorFlowTraining and Evaluation in KerasModel FunctionInput and FeaturesTraining and Evaluating Input FunctionsSaving and ExportingPerforming a Training RunTraining in the CloudWide-and-Deep ModelHyperparameter TuningDeploying the ModelPredicting with the ModelExplaining the ModelSummary

10. Real-Time Machine Learning
Invoking Prediction ServiceJava Classes for Request and ResponsePost Request and Parse ResponseClient of Prediction ServiceAdding Predictions to Flight InformationBatch Input and OutputData Processing PipelineIdentifying InefficiencyBatching RequestsStreaming PipelineFlattening PCollectionsExecuting Streaming PipelineLate and Out-of-Order RecordsWatermarks and TriggersTransactions, Throughput, and LatencyPossible Streaming SinksCloud BigtableDesigning TablesDesigning the Row KeyStreaming into Cloud BigtableQuerying from Cloud BigtableEvaluating Model PerformanceThe Need for Continuous TrainingEvaluation PipelineEvaluating PerformanceMarginal DistributionsChecking Model BehaviorIdentifying Behavioral ChangeSummaryBook Summary
A. Considerations for Sensitive Data within Machine Learning Datasets
Handling Sensitive InformationIdentifying Sensitive DataProtecting Sensitive DataRemoving Sensitive DataMasking Sensitive DataCoarsening Sensitive DataEstablishing a Governance Policy
Index

Content preview from Data Science on the Google Cloud Platform

Chapter 9. Machine Learning Classifier Using TensorFlow

In Chapter 7, we built a machine learning model but ran into problems when trying to scale it out and make it operational. The first problem was how to prevent training–serving skew when using a time-windowed aggregate feature. We solved this in Chapter 8 by using the same code for computing the aggregates on historical data as will be used on real-time data. The Cloud Dataflow pipeline that we implemented in Chapter 8 was used to create two sets of files: trainFlights*.csv, which will serve as our training dataset for machine learning, and testFlights*.csv, which we will use to evaluate the model. Both of these files contain augmented datasets—the purpose of the pipeline was to add the computed time-aggregates to the raw data received from the airlines. We want to predict the first column in those files (whether the flight is on time) based on the other columns (departure delay, taxi-out time, distance, and average departure and arrival delays, and a few other fields).

While we solved the problem of dataset augmentation with time aggregates, the other three problems identified at the end of Chapter 7 remain:

One-hot encoding categorical columns caused an explosion in the size of the dataset
Embeddings would involve special bookkeeping
Putting the model into production requires the machine learning library to be portable to environments beyond the cluster on which the model is trained.

The solution to these three ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical AI on the Google Cloud Platform

Publisher Resources

ISBN: 9781491974551Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science on the Google Cloud Platform

by Valliappa Lakshmanan

Chapter 9. Machine Learning Classifier Using TensorFlow

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.