book

Data Science on the Google Cloud Platform, 2nd Edition

by Valliappa Lakshmanan

March 2022

Beginner to intermediate

459 pages

12h 19m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Many Similar DecisionsThe Role of Data ScientistsScrappy EnvironmentFull Stack Cloud Data ScientistsCollaborationBest PracticesSimple to Complex SolutionsCloud ComputingServerlessA Probabilistic DecisionProbabilistic ApproachProbability Density FunctionCumulative Distribution FunctionChoices MadeChoosing CloudNot a Reference BookGetting Started with the CodeAgile Architecture for Data Science on Google CloudWhat Is Agile Architecture?No-Code, Low-CodeUse Managed ServicesSummarySuggested Resources
Airline On-Time Performance DataKnowabilityCausalityTraining–Serving SkewDownloading DataHub-and-Spoke ArchitectureDataset FieldsSeparation of Compute and StorageScaling UpScaling Out with Sharded DataScaling Out with Data-in-PlaceIngesting DataReverse Engineering a Web FormDataset DownloadExploration and CleanupUploading Data to Google Cloud StorageLoading Data into Google BigQueryAdvantages of a Serverless Columnar DatabaseStaging on Cloud StorageAccess ControlIngesting CSV FilesPartitioningScheduling Monthly DownloadsIngesting in PythonCloud RunSecuring Cloud RunDeploying and Invoking Cloud RunScheduling Cloud RunSummaryCode BreakSuggested Resources
Explain Your Model with DashboardsWhy Build a Dashboard First?Accuracy, Honesty, and Good DesignLoading Data into Cloud SQLCreate a Google Cloud SQL InstanceCreate Table of DataInteracting with the DatabaseQuerying Using BigQuerySchema ExplorationUsing PreviewUsing Table ExplorerCreating BigQuery ViewBuilding Our First ModelContingency TableThreshold OptimizationBuilding a DashboardGetting Started with Data StudioCreating ChartsAdding End-User ControlsShowing Proportions with a Pie ChartExplaining a Contingency TableModern Business IntelligenceDigitizationNatural Language QueriesConnected SheetsSummarySuggested Resources
Designing the Event FeedTransformations NeededArchitectureGetting Airport InformationSharing DataTime CorrectionApache Beam/Cloud DataflowParsing Airports DataAdding Time Zone InformationConverting Times to UTCCorrecting DatesCreating EventsReading and Writing to the CloudRunning the Pipeline in the CloudPublishing an Event Stream to Cloud Pub/SubSpeed-Up FactorGet Records to PublishHow Many Topics?Iterating Through RecordsBuilding a Batch of EventsPublishing a Batch of EventsReal-Time Stream ProcessingStreaming in DataflowWindowing a PipelineStreaming AggregationUsing Event TimestampsExecuting the Stream ProcessingAnalyzing Streaming Data in BigQueryReal-Time DashboardSummarySuggested Resources
Exploratory Data AnalysisExploration with SQLReading a Query ExplanationExploratory Data Analysis in Vertex AI WorkbenchJupyter NotebooksCreating a NotebookJupyter CommandsInstalling PackagesJupyter Magic for Google CloudExploring Arrival DelaysBasic StatisticsPlotting DistributionsQuality ControlArrival Delay Conditioned on Departure DelayEvaluating the ModelRandom ShufflingSplitting by DateTraining and TestingSummarySuggested Resources
MapReduce and the Hadoop EcosystemHow MapReduce WorksApache HadoopGoogle Cloud DataprocNeed for Higher-Level ToolsJobs, Not ClustersPreinstalling SoftwareQuantization Using Spark SQLJupyterLab on Cloud DataprocIndependence Check Using BigQuerySpark SQL in JupyterLabHistogram EqualizationBayesian ClassificationBayes in Each BinEvaluating the ModelDynamically Resizing ClustersComparing to Single Threshold ModelOrchestrationSubmitting a Spark JobWorkflow TemplateCloud ComposerAutoscalingServerless SparkSummarySuggested Resources
Logistic RegressionHow Logistic Regression WorksSpark ML LibraryGetting Started with Spark Machine LearningSpark Logistic RegressionCreating a Training DatasetTraining the ModelPredicting Using the ModelEvaluating a ModelFeature EngineeringExperimental FrameworkFeature SelectionFeature TransformationsFeature CreationCategorical VariablesRepeatable, Real TimeSummarySuggested Resources
Logistic RegressionPresplit DataInterrogating the ModelEvaluating the ModelScale and SimplicityNonlinear Machine LearningXGBoostHyperparameter TuningVertex AI AutoML TablesTime Window FeaturesTaxi-Out TimeCompounding DelaysCausalityTime FeaturesDeparture HourTransform ClauseCategorical VariableFeature CrossSummarySuggested Resources
Toward More Complex ModelsPreparing BigQuery Data for TensorFlowReading Data into TensorFlowTraining and Evaluation in KerasModel FunctionFeaturesInputsTraining the Keras ModelSaving and ExportingDeep Neural NetworkWide-and-Deep Model in KerasRepresenting Air Traffic CorridorsBucketingFeature CrossingWide-and-Deep ClassifierDeploying a Trained TensorFlow Model to Vertex AIConceptsUploading ModelCreating EndpointDeploying Model to EndpointInvoking the Deployed ModelSummarySuggested Resources

Developing and Deploying Using PythonWriting model.pyWriting the Training PipelinePredefined SplitAutoMLHyperparameter TuningParameterize ModelShorten Training RunMetrics During TrainingHyperparameter Tuning PipelineBest Trial to CompletionExplaining the ModelConfiguring Explanations MetadataCreating and Deploying ModelObtaining ExplanationsSummarySuggested Resources
Time AveragesApache Beam and Cloud DataflowReading and WritingTime WindowingMachine Learning TrainingMachine Learning DatasetTraining the ModelStreaming PredictionsReuse TransformsInput and OutputInvoking ModelReusing EndpointBatching PredictionsStreaming PipelineWriting to BigQueryExecuting Streaming PipelineLate and Out-of-Order RecordsPossible Streaming SinksSummarySuggested Resources
Four Years of DataCreating DatasetTraining ModelEvaluationSummarySuggested Resources
Handling Sensitive InformationSensitive Data in ColumnsSensitive Data in Natural Language DatasetsSensitive Data in Free-Form Unstructured DataSensitive Data in a Combination of FieldsSensitive Data in Unstructured ContentProtecting Sensitive DataRemoving Sensitive DataMasking Sensitive DataCoarsening Sensitive DataEstablishing a Governance Policy

Content preview from Data Science on the Google Cloud Platform, 2nd Edition

Preface

In my current role at Google, I get to work alongside data scientists and data engineers in a variety of industries as they move their data processing and analysis methods to the public cloud. Some try to do the same things they do on premises, the same way they do them, just on rented computing resources. The visionary users, though, rethink their systems, transform how they work with data, and thereby are able to innovate faster.

As early as 2011, an article in Harvard Business Review recognized that some of cloud computing’s greatest successes come from allowing groups and communities to work together in ways that were not previously possible. This is now much more widely recognized. An MIT survey in 2017 found that more respondents (45%) cited increased agility rather than cost savings (34%) as the reason to move to the public cloud. However, it is still not widely achieved. McKinsey estimated in 2021 that companies are leaving behind nearly $1 trillion of value by not looking at the public cloud as a source of transformative value. Therefore, being able to work on a data science project in the cloud is a skill well worth investing in.

In this book, we walk through an example of a cloud-native, transformative, collaborative way of doing data science. You will learn how to implement an end-to-end data pipeline—we will begin with ingesting the data in a serverless way and work our way through data exploration, dashboards, relational databases, and streaming data all ...