book

Data Science on the Google Cloud Platform

Name: Data Science on the Google Cloud Platform
Author: Valliappa Lakshmanan
ISBN: 9781491974513

by Valliappa Lakshmanan

December 2017

Beginner to intermediate

404 pages

11h 12m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Making Better Decisions Based on Data
Many Similar DecisionsThe Role of Data EngineersThe Cloud Makes Data Engineers PossibleThe Cloud Turbocharges Data ScienceCase Studies Get at the Stubborn FactsA Probabilistic DecisionData and ToolsGetting Started with the CodeSummary
2. Ingesting Data into the Cloud
Airline On-Time Performance DataKnowabilityTraining–Serving SkewDownload ProcedureDataset FieldsWhy Not Store the Data in Situ?Scaling UpScaling OutData in Situ with Colossus and JupiterIngesting DataReverse Engineering a Web FormDataset DownloadExploration and CleanupUploading Data to Google Cloud StorageScheduling Monthly DownloadsIngesting in PythonCloud FunctionsSecuring the URLScheduling the Cloud FunctionImproving the Cloud Function DesignSummaryCode Break
3. Creating Compelling Dashboards
Explain Your Model with DashboardsWhy Build a Dashboard First?Accuracy, Honesty, and Good DesignLoading Data into Google Cloud SQLCreate a Google Cloud SQL InstanceInteracting with Google Cloud PlatformControlling Access to MySQLCreate TablesPopulating TablesBuilding Our First ModelContingency TableThreshold OptimizationMachine LearningBuilding a DashboardGetting Started with Data StudioCreating ChartsAdding End-User ControlsShowing Proportions with a Pie ChartExplaining a Contingency TableSummary
4. Streaming Data: Publication and Ingest
Designing the Event FeedTime CorrectionApache Beam/Cloud DataflowParsing Airports DataAdding Time Zone InformationConverting Times to UTCCorrecting DatesCreating EventsRunning the Pipeline in the CloudPublishing an Event Stream to Cloud Pub/SubGet Records to PublishPaging Through RecordsBuilding a Batch of EventsPublishing a Batch of EventsReal-Time Stream ProcessingStreaming in Java DataflowExecuting the Stream ProcessingAnalyzing Streaming Data in BigQueryReal-Time DashboardSummary
5. Interactive Data Exploration
Exploratory Data AnalysisLoading Flights Data into BigQueryAdvantages of a Serverless Columnar DatabaseStaging on Cloud StorageAccess ControlFederated QueriesIngesting CSV FilesExploratory Data Analysis in Cloud AI Platform NotebooksJupyter NotebooksCloud AI Platform NotebooksInstalling Packages in Cloud AI Platform NotebooksJupyter Magic for Google Cloud PlatformQuality ControlOddball ValuesOutlier Removal: Big Data Is DifferentFiltering Data on Occurrence FrequencyArrival Delay Conditioned on Departure DelayApplying Probabilistic Decision ThresholdEmpirical Probability Distribution FunctionThe Answer Is...Evaluating the ModelRandom ShufflingSplitting by DateTraining and TestingSummary
6. Bayes Classifier on Cloud Dataproc
MapReduce and the Hadoop EcosystemHow MapReduce WorksApache HadoopGoogle Cloud DataprocNeed for Higher-Level ToolsJobs, Not ClustersInitialization ActionsQuantization Using Spark SQLJupyterLab on Cloud DataprocIndependence Check Using BigQuerySpark SQL in JupyterLabHistogram EqualizationDynamically Resizing ClustersBayes Classification Using PigRunning a Pig Job on Cloud DataprocAutomating Cloud Dataproc with Workflow TemplatesLimiting to Training DaysThe Decision CriteriaEvaluating the Bayesian ModelSummary
7. Machine Learning: Logistic Regression in Spark and BigQuery
Logistic RegressionSpark ML LibraryGetting Started with Spark Machine LearningSpark Logistic RegressionCreating a Training DatasetDealing with Corner CasesCreating Training ExamplesTrainingPredicting by Using a ModelEvaluating a ModelFeature EngineeringExperimental FrameworkCreating the Held-Out DatasetFeature SelectionScaling and Clipping FeaturesFeature TransformsCategorical VariablesScalable Machine Learning Models in BigQueryRepeatable, Real TimeSummary
8. Time-Windowed Aggregate Features
The Need for Time AveragesDataflow in JavaSetting Up Development EnvironmentFiltering with BeamPipeline Options and Text I/ORun on CloudParsing into ObjectsComputing Time AveragesGrouping and CombiningParallel Do with Side InputDebuggingBigQueryIOMutating the Flight ObjectSliding Window Computation in Batch ModeRunning in the CloudMonitoring, Troubleshooting, and Performance TuningTroubleshooting PipelineSide Input LimitationsRedesigning the PipelineRemoving DuplicatesSummary
9. Machine Learning Classifier Using TensorFlow
Toward More Complex ModelsReading Data into TensorFlowTraining and Evaluation in KerasModel FunctionInput and FeaturesTraining and Evaluating Input FunctionsSaving and ExportingPerforming a Training RunTraining in the CloudWide-and-Deep ModelHyperparameter TuningDeploying the ModelPredicting with the ModelExplaining the ModelSummary

10. Real-Time Machine Learning
Invoking Prediction ServiceJava Classes for Request and ResponsePost Request and Parse ResponseClient of Prediction ServiceAdding Predictions to Flight InformationBatch Input and OutputData Processing PipelineIdentifying InefficiencyBatching RequestsStreaming PipelineFlattening PCollectionsExecuting Streaming PipelineLate and Out-of-Order RecordsWatermarks and TriggersTransactions, Throughput, and LatencyPossible Streaming SinksCloud BigtableDesigning TablesDesigning the Row KeyStreaming into Cloud BigtableQuerying from Cloud BigtableEvaluating Model PerformanceThe Need for Continuous TrainingEvaluation PipelineEvaluating PerformanceMarginal DistributionsChecking Model BehaviorIdentifying Behavioral ChangeSummaryBook Summary
A. Considerations for Sensitive Data within Machine Learning Datasets
Handling Sensitive InformationIdentifying Sensitive DataProtecting Sensitive DataRemoving Sensitive DataMasking Sensitive DataCoarsening Sensitive DataEstablishing a Governance Policy
Index

Content preview from Data Science on the Google Cloud Platform

Appendix A. Considerations for Sensitive Data within Machine Learning Datasets

Note

The content of this appendix, written by the author and Brad Svee, was published as a solution paper on the Google Cloud Platform documentation website.

When you are developing a machine learning (ML) program, it’s important to balance data access within your company against the security implications of that access. You want insights contained in the raw dataset to guide ML training even as access to sensitive data is limited. To achieve both goals, it’s useful to train ML systems on a subset of the raw data, or on the entire dataset after partial application of any number of aggregation or obfuscation techniques.

For example, you might want your data engineers to train an ML model to weigh customer feedback on a product, but you don’t want them to know who submitted the feedback. However, information such as delivery address and purchase history is critically important for training the ML model. After the data is provided to the data engineers, they will need to query it for data exploration purposes, so it is important to protect your sensitive data fields before making it available. This type of dilemma is also common in ML models that involve recommendation engines. To create a model that returns user-specific results, you typically need access to user-specific data.

Fortunately, there are techniques you can use to remove some sensitive data from your datasets while still training effective ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Practical AI on the Google Cloud Platform

Publisher Resources

ISBN: 9781491974551Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science on the Google Cloud Platform

by Valliappa Lakshmanan

Appendix A. Considerations for Sensitive Data within Machine Learning Datasets

Note

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.