book

Data Science on the Google Cloud Platform

Name: Data Science on the Google Cloud Platform
Author: Valliappa Lakshmanan
ISBN: 9781491974513

by Valliappa Lakshmanan

December 2017

Beginner to intermediate

404 pages

11h 12m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Making Better Decisions Based on Data
Many Similar DecisionsThe Role of Data EngineersThe Cloud Makes Data Engineers PossibleThe Cloud Turbocharges Data ScienceCase Studies Get at the Stubborn FactsA Probabilistic DecisionData and ToolsGetting Started with the CodeSummary
2. Ingesting Data into the Cloud
Airline On-Time Performance DataKnowabilityTraining–Serving SkewDownload ProcedureDataset FieldsWhy Not Store the Data in Situ?Scaling UpScaling OutData in Situ with Colossus and JupiterIngesting DataReverse Engineering a Web FormDataset DownloadExploration and CleanupUploading Data to Google Cloud StorageScheduling Monthly DownloadsIngesting in PythonCloud FunctionsSecuring the URLScheduling the Cloud FunctionImproving the Cloud Function DesignSummaryCode Break
3. Creating Compelling Dashboards
Explain Your Model with DashboardsWhy Build a Dashboard First?Accuracy, Honesty, and Good DesignLoading Data into Google Cloud SQLCreate a Google Cloud SQL InstanceInteracting with Google Cloud PlatformControlling Access to MySQLCreate TablesPopulating TablesBuilding Our First ModelContingency TableThreshold OptimizationMachine LearningBuilding a DashboardGetting Started with Data StudioCreating ChartsAdding End-User ControlsShowing Proportions with a Pie ChartExplaining a Contingency TableSummary
4. Streaming Data: Publication and Ingest
Designing the Event FeedTime CorrectionApache Beam/Cloud DataflowParsing Airports DataAdding Time Zone InformationConverting Times to UTCCorrecting DatesCreating EventsRunning the Pipeline in the CloudPublishing an Event Stream to Cloud Pub/SubGet Records to PublishPaging Through RecordsBuilding a Batch of EventsPublishing a Batch of EventsReal-Time Stream ProcessingStreaming in Java DataflowExecuting the Stream ProcessingAnalyzing Streaming Data in BigQueryReal-Time DashboardSummary
5. Interactive Data Exploration
Exploratory Data AnalysisLoading Flights Data into BigQueryAdvantages of a Serverless Columnar DatabaseStaging on Cloud StorageAccess ControlFederated QueriesIngesting CSV FilesExploratory Data Analysis in Cloud AI Platform NotebooksJupyter NotebooksCloud AI Platform NotebooksInstalling Packages in Cloud AI Platform NotebooksJupyter Magic for Google Cloud PlatformQuality ControlOddball ValuesOutlier Removal: Big Data Is DifferentFiltering Data on Occurrence FrequencyArrival Delay Conditioned on Departure DelayApplying Probabilistic Decision ThresholdEmpirical Probability Distribution FunctionThe Answer Is...Evaluating the ModelRandom ShufflingSplitting by DateTraining and TestingSummary
6. Bayes Classifier on Cloud Dataproc
MapReduce and the Hadoop EcosystemHow MapReduce WorksApache HadoopGoogle Cloud DataprocNeed for Higher-Level ToolsJobs, Not ClustersInitialization ActionsQuantization Using Spark SQLJupyterLab on Cloud DataprocIndependence Check Using BigQuerySpark SQL in JupyterLabHistogram EqualizationDynamically Resizing ClustersBayes Classification Using PigRunning a Pig Job on Cloud DataprocAutomating Cloud Dataproc with Workflow TemplatesLimiting to Training DaysThe Decision CriteriaEvaluating the Bayesian ModelSummary
7. Machine Learning: Logistic Regression in Spark and BigQuery
Logistic RegressionSpark ML LibraryGetting Started with Spark Machine LearningSpark Logistic RegressionCreating a Training DatasetDealing with Corner CasesCreating Training ExamplesTrainingPredicting by Using a ModelEvaluating a ModelFeature EngineeringExperimental FrameworkCreating the Held-Out DatasetFeature SelectionScaling and Clipping FeaturesFeature TransformsCategorical VariablesScalable Machine Learning Models in BigQueryRepeatable, Real TimeSummary
8. Time-Windowed Aggregate Features
The Need for Time AveragesDataflow in JavaSetting Up Development EnvironmentFiltering with BeamPipeline Options and Text I/ORun on CloudParsing into ObjectsComputing Time AveragesGrouping and CombiningParallel Do with Side InputDebuggingBigQueryIOMutating the Flight ObjectSliding Window Computation in Batch ModeRunning in the CloudMonitoring, Troubleshooting, and Performance TuningTroubleshooting PipelineSide Input LimitationsRedesigning the PipelineRemoving DuplicatesSummary
9. Machine Learning Classifier Using TensorFlow
Toward More Complex ModelsReading Data into TensorFlowTraining and Evaluation in KerasModel FunctionInput and FeaturesTraining and Evaluating Input FunctionsSaving and ExportingPerforming a Training RunTraining in the CloudWide-and-Deep ModelHyperparameter TuningDeploying the ModelPredicting with the ModelExplaining the ModelSummary

10. Real-Time Machine Learning
Invoking Prediction ServiceJava Classes for Request and ResponsePost Request and Parse ResponseClient of Prediction ServiceAdding Predictions to Flight InformationBatch Input and OutputData Processing PipelineIdentifying InefficiencyBatching RequestsStreaming PipelineFlattening PCollectionsExecuting Streaming PipelineLate and Out-of-Order RecordsWatermarks and TriggersTransactions, Throughput, and LatencyPossible Streaming SinksCloud BigtableDesigning TablesDesigning the Row KeyStreaming into Cloud BigtableQuerying from Cloud BigtableEvaluating Model PerformanceThe Need for Continuous TrainingEvaluation PipelineEvaluating PerformanceMarginal DistributionsChecking Model BehaviorIdentifying Behavioral ChangeSummaryBook Summary
A. Considerations for Sensitive Data within Machine Learning Datasets
Handling Sensitive InformationIdentifying Sensitive DataProtecting Sensitive DataRemoving Sensitive DataMasking Sensitive DataCoarsening Sensitive DataEstablishing a Governance Policy
Index

Content preview from Data Science on the Google Cloud Platform

Preface

In my current role at Google, I get to work alongside data scientists and data engineers in a variety of industries as they move their data processing and analysis methods to the public cloud. Some try to do the same things they do on-premises, the same way they do them, just on rented computing resources. The visionary users, though, rethink their systems, transform how they work with data, and thereby are able to innovate faster.

As early as 2011, an article in Harvard Business Review recognized that some of cloud computing’s greatest successes come from allowing groups and communities to work together in ways that were not previously possible. This is now much more widely recognized—an MIT survey in 2017 found that more respondents (45%) cited increased agility rather than cost savings (34%) as the reason to move to the public cloud.

In this book, we walk through an example of this new transformative, more collaborative way of doing data science. You will learn how to implement an end-to-end data pipeline—we will begin with ingesting the data in a serverless way and work our way through data exploration, dashboards, relational databases, and streaming data all the way to training and making operational a machine learning model. I cover all these aspects of data-based services because data engineers will be involved in designing the services, developing the statistical and machine learning models and implementing them in large-scale production and in real time.

Who ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical AI on the Google Cloud Platform

Publisher Resources

ISBN: 9781491974551Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science on the Google Cloud Platform

by Valliappa Lakshmanan

Preface

Who ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.