book

Data Science on the Google Cloud Platform

Name: Data Science on the Google Cloud Platform
Author: Valliappa Lakshmanan
ISBN: 9781491974513

by Valliappa Lakshmanan

December 2017

Beginner to intermediate

404 pages

11h 12m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Making Better Decisions Based on Data
Many Similar DecisionsThe Role of Data EngineersThe Cloud Makes Data Engineers PossibleThe Cloud Turbocharges Data ScienceCase Studies Get at the Stubborn FactsA Probabilistic DecisionData and ToolsGetting Started with the CodeSummary
2. Ingesting Data into the Cloud
Airline On-Time Performance DataKnowabilityTraining–Serving SkewDownload ProcedureDataset FieldsWhy Not Store the Data in Situ?Scaling UpScaling OutData in Situ with Colossus and JupiterIngesting DataReverse Engineering a Web FormDataset DownloadExploration and CleanupUploading Data to Google Cloud StorageScheduling Monthly DownloadsIngesting in PythonCloud FunctionsSecuring the URLScheduling the Cloud FunctionImproving the Cloud Function DesignSummaryCode Break
3. Creating Compelling Dashboards
Explain Your Model with DashboardsWhy Build a Dashboard First?Accuracy, Honesty, and Good DesignLoading Data into Google Cloud SQLCreate a Google Cloud SQL InstanceInteracting with Google Cloud PlatformControlling Access to MySQLCreate TablesPopulating TablesBuilding Our First ModelContingency TableThreshold OptimizationMachine LearningBuilding a DashboardGetting Started with Data StudioCreating ChartsAdding End-User ControlsShowing Proportions with a Pie ChartExplaining a Contingency TableSummary
4. Streaming Data: Publication and Ingest
Designing the Event FeedTime CorrectionApache Beam/Cloud DataflowParsing Airports DataAdding Time Zone InformationConverting Times to UTCCorrecting DatesCreating EventsRunning the Pipeline in the CloudPublishing an Event Stream to Cloud Pub/SubGet Records to PublishPaging Through RecordsBuilding a Batch of EventsPublishing a Batch of EventsReal-Time Stream ProcessingStreaming in Java DataflowExecuting the Stream ProcessingAnalyzing Streaming Data in BigQueryReal-Time DashboardSummary
5. Interactive Data Exploration
Exploratory Data AnalysisLoading Flights Data into BigQueryAdvantages of a Serverless Columnar DatabaseStaging on Cloud StorageAccess ControlFederated QueriesIngesting CSV FilesExploratory Data Analysis in Cloud AI Platform NotebooksJupyter NotebooksCloud AI Platform NotebooksInstalling Packages in Cloud AI Platform NotebooksJupyter Magic for Google Cloud PlatformQuality ControlOddball ValuesOutlier Removal: Big Data Is DifferentFiltering Data on Occurrence FrequencyArrival Delay Conditioned on Departure DelayApplying Probabilistic Decision ThresholdEmpirical Probability Distribution FunctionThe Answer Is...Evaluating the ModelRandom ShufflingSplitting by DateTraining and TestingSummary
6. Bayes Classifier on Cloud Dataproc
MapReduce and the Hadoop EcosystemHow MapReduce WorksApache HadoopGoogle Cloud DataprocNeed for Higher-Level ToolsJobs, Not ClustersInitialization ActionsQuantization Using Spark SQLJupyterLab on Cloud DataprocIndependence Check Using BigQuerySpark SQL in JupyterLabHistogram EqualizationDynamically Resizing ClustersBayes Classification Using PigRunning a Pig Job on Cloud DataprocAutomating Cloud Dataproc with Workflow TemplatesLimiting to Training DaysThe Decision CriteriaEvaluating the Bayesian ModelSummary
7. Machine Learning: Logistic Regression in Spark and BigQuery
Logistic RegressionSpark ML LibraryGetting Started with Spark Machine LearningSpark Logistic RegressionCreating a Training DatasetDealing with Corner CasesCreating Training ExamplesTrainingPredicting by Using a ModelEvaluating a ModelFeature EngineeringExperimental FrameworkCreating the Held-Out DatasetFeature SelectionScaling and Clipping FeaturesFeature TransformsCategorical VariablesScalable Machine Learning Models in BigQueryRepeatable, Real TimeSummary
8. Time-Windowed Aggregate Features
The Need for Time AveragesDataflow in JavaSetting Up Development EnvironmentFiltering with BeamPipeline Options and Text I/ORun on CloudParsing into ObjectsComputing Time AveragesGrouping and CombiningParallel Do with Side InputDebuggingBigQueryIOMutating the Flight ObjectSliding Window Computation in Batch ModeRunning in the CloudMonitoring, Troubleshooting, and Performance TuningTroubleshooting PipelineSide Input LimitationsRedesigning the PipelineRemoving DuplicatesSummary
9. Machine Learning Classifier Using TensorFlow
Toward More Complex ModelsReading Data into TensorFlowTraining and Evaluation in KerasModel FunctionInput and FeaturesTraining and Evaluating Input FunctionsSaving and ExportingPerforming a Training RunTraining in the CloudWide-and-Deep ModelHyperparameter TuningDeploying the ModelPredicting with the ModelExplaining the ModelSummary

10. Real-Time Machine Learning
Invoking Prediction ServiceJava Classes for Request and ResponsePost Request and Parse ResponseClient of Prediction ServiceAdding Predictions to Flight InformationBatch Input and OutputData Processing PipelineIdentifying InefficiencyBatching RequestsStreaming PipelineFlattening PCollectionsExecuting Streaming PipelineLate and Out-of-Order RecordsWatermarks and TriggersTransactions, Throughput, and LatencyPossible Streaming SinksCloud BigtableDesigning TablesDesigning the Row KeyStreaming into Cloud BigtableQuerying from Cloud BigtableEvaluating Model PerformanceThe Need for Continuous TrainingEvaluation PipelineEvaluating PerformanceMarginal DistributionsChecking Model BehaviorIdentifying Behavioral ChangeSummaryBook Summary
A. Considerations for Sensitive Data within Machine Learning Datasets
Handling Sensitive InformationIdentifying Sensitive DataProtecting Sensitive DataRemoving Sensitive DataMasking Sensitive DataCoarsening Sensitive DataEstablishing a Governance Policy
Index

Content preview from Data Science on the Google Cloud Platform

Chapter 4. Streaming Data: Publication and Ingest

In Chapter 3, we developed a dashboard to explain a contingency table–based model of suggesting whether to cancel a meeting. However, the dashboard that we built lacked immediacy because it was not tied to users’ context. Because users need to be able to view a dashboard and see the information that is relevant to them at that point, we need to build a real-time dashboard with location cues.

How would we add context to our dashboard? We’d have to show maps of delays in real time. To do that, we’ll need locations of the airports, and we’ll need real-time data. Airport locations can be obtained from the US Bureau of Transportation Statistics (BTS; the same US government agency from which we obtained our historical flight data). Real-time flight data, however, is a commercial product. If we were to build a business out of predicting flight arrivals, we’d purchase that data feed. For the purposes of this book, however, let’s just simulate it.

Simulating the creation of a real-time feed from historical data has the advantage of allowing us to see both sides of a streaming pipeline (production as well as consumption). In the following section, we look at how we could stream the ingest of data into the database if we were to receive it in real time.

Designing the Event Feed

To create a real-time stream of flight information, we begin by using historical data that is appropriately transformed from what we downloaded from the BTS. What ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical AI on the Google Cloud Platform

Publisher Resources

ISBN: 9781491974551Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science on the Google Cloud Platform

by Valliappa Lakshmanan

Chapter 4. Streaming Data: Publication and Ingest

Designing the Event Feed

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.