book

Data Science on the Google Cloud Platform, 2nd Edition

by Valliappa Lakshmanan

March 2022

Beginner to intermediate

459 pages

12h 19m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Many Similar DecisionsThe Role of Data ScientistsScrappy EnvironmentFull Stack Cloud Data ScientistsCollaborationBest PracticesSimple to Complex SolutionsCloud ComputingServerlessA Probabilistic DecisionProbabilistic ApproachProbability Density FunctionCumulative Distribution FunctionChoices MadeChoosing CloudNot a Reference BookGetting Started with the CodeAgile Architecture for Data Science on Google CloudWhat Is Agile Architecture?No-Code, Low-CodeUse Managed ServicesSummarySuggested Resources
Airline On-Time Performance DataKnowabilityCausalityTraining–Serving SkewDownloading DataHub-and-Spoke ArchitectureDataset FieldsSeparation of Compute and StorageScaling UpScaling Out with Sharded DataScaling Out with Data-in-PlaceIngesting DataReverse Engineering a Web FormDataset DownloadExploration and CleanupUploading Data to Google Cloud StorageLoading Data into Google BigQueryAdvantages of a Serverless Columnar DatabaseStaging on Cloud StorageAccess ControlIngesting CSV FilesPartitioningScheduling Monthly DownloadsIngesting in PythonCloud RunSecuring Cloud RunDeploying and Invoking Cloud RunScheduling Cloud RunSummaryCode BreakSuggested Resources
Explain Your Model with DashboardsWhy Build a Dashboard First?Accuracy, Honesty, and Good DesignLoading Data into Cloud SQLCreate a Google Cloud SQL InstanceCreate Table of DataInteracting with the DatabaseQuerying Using BigQuerySchema ExplorationUsing PreviewUsing Table ExplorerCreating BigQuery ViewBuilding Our First ModelContingency TableThreshold OptimizationBuilding a DashboardGetting Started with Data StudioCreating ChartsAdding End-User ControlsShowing Proportions with a Pie ChartExplaining a Contingency TableModern Business IntelligenceDigitizationNatural Language QueriesConnected SheetsSummarySuggested Resources
Designing the Event FeedTransformations NeededArchitectureGetting Airport InformationSharing DataTime CorrectionApache Beam/Cloud DataflowParsing Airports DataAdding Time Zone InformationConverting Times to UTCCorrecting DatesCreating EventsReading and Writing to the CloudRunning the Pipeline in the CloudPublishing an Event Stream to Cloud Pub/SubSpeed-Up FactorGet Records to PublishHow Many Topics?Iterating Through RecordsBuilding a Batch of EventsPublishing a Batch of EventsReal-Time Stream ProcessingStreaming in DataflowWindowing a PipelineStreaming AggregationUsing Event TimestampsExecuting the Stream ProcessingAnalyzing Streaming Data in BigQueryReal-Time DashboardSummarySuggested Resources
Exploratory Data AnalysisExploration with SQLReading a Query ExplanationExploratory Data Analysis in Vertex AI WorkbenchJupyter NotebooksCreating a NotebookJupyter CommandsInstalling PackagesJupyter Magic for Google CloudExploring Arrival DelaysBasic StatisticsPlotting DistributionsQuality ControlArrival Delay Conditioned on Departure DelayEvaluating the ModelRandom ShufflingSplitting by DateTraining and TestingSummarySuggested Resources
MapReduce and the Hadoop EcosystemHow MapReduce WorksApache HadoopGoogle Cloud DataprocNeed for Higher-Level ToolsJobs, Not ClustersPreinstalling SoftwareQuantization Using Spark SQLJupyterLab on Cloud DataprocIndependence Check Using BigQuerySpark SQL in JupyterLabHistogram EqualizationBayesian ClassificationBayes in Each BinEvaluating the ModelDynamically Resizing ClustersComparing to Single Threshold ModelOrchestrationSubmitting a Spark JobWorkflow TemplateCloud ComposerAutoscalingServerless SparkSummarySuggested Resources
Logistic RegressionHow Logistic Regression WorksSpark ML LibraryGetting Started with Spark Machine LearningSpark Logistic RegressionCreating a Training DatasetTraining the ModelPredicting Using the ModelEvaluating a ModelFeature EngineeringExperimental FrameworkFeature SelectionFeature TransformationsFeature CreationCategorical VariablesRepeatable, Real TimeSummarySuggested Resources
Logistic RegressionPresplit DataInterrogating the ModelEvaluating the ModelScale and SimplicityNonlinear Machine LearningXGBoostHyperparameter TuningVertex AI AutoML TablesTime Window FeaturesTaxi-Out TimeCompounding DelaysCausalityTime FeaturesDeparture HourTransform ClauseCategorical VariableFeature CrossSummarySuggested Resources
Toward More Complex ModelsPreparing BigQuery Data for TensorFlowReading Data into TensorFlowTraining and Evaluation in KerasModel FunctionFeaturesInputsTraining the Keras ModelSaving and ExportingDeep Neural NetworkWide-and-Deep Model in KerasRepresenting Air Traffic CorridorsBucketingFeature CrossingWide-and-Deep ClassifierDeploying a Trained TensorFlow Model to Vertex AIConceptsUploading ModelCreating EndpointDeploying Model to EndpointInvoking the Deployed ModelSummarySuggested Resources

Developing and Deploying Using PythonWriting model.pyWriting the Training PipelinePredefined SplitAutoMLHyperparameter TuningParameterize ModelShorten Training RunMetrics During TrainingHyperparameter Tuning PipelineBest Trial to CompletionExplaining the ModelConfiguring Explanations MetadataCreating and Deploying ModelObtaining ExplanationsSummarySuggested Resources
Time AveragesApache Beam and Cloud DataflowReading and WritingTime WindowingMachine Learning TrainingMachine Learning DatasetTraining the ModelStreaming PredictionsReuse TransformsInput and OutputInvoking ModelReusing EndpointBatching PredictionsStreaming PipelineWriting to BigQueryExecuting Streaming PipelineLate and Out-of-Order RecordsPossible Streaming SinksSummarySuggested Resources
Four Years of DataCreating DatasetTraining ModelEvaluationSummarySuggested Resources
Handling Sensitive InformationSensitive Data in ColumnsSensitive Data in Natural Language DatasetsSensitive Data in Free-Form Unstructured DataSensitive Data in a Combination of FieldsSensitive Data in Unstructured ContentProtecting Sensitive DataRemoving Sensitive DataMasking Sensitive DataCoarsening Sensitive DataEstablishing a Governance Policy

Content preview from Data Science on the Google Cloud Platform, 2nd Edition

Chapter 9. Machine Learning with TensorFlow in Vertex AI

In Chapter 7, we built a machine learning model in Spark but ran into problems when trying to scale it out and make it operational. We were able to address the scalability challenge by using BigQuery ML in Chapter 8, but the operationalization challenges still remain. In addition, although BigQuery ML was scalable, we were not able to build the most expressive ML model possible. Briefly, there are four challenges that we identified:

One-hot encoding of categorical columns caused an explosion in the size of the dataset because of the increased size of the columns. BigQuery ML was able to handle this, but Spark wasn’t.
Embeddings would have involved special bookkeeping in Spark, and this was not an option in BigQuery ML.
Putting the model into production requires the machine learning library to be portable to environments beyond the Hadoop cluster or BigQuery data warehouse on which the model is trained.
Preventing training–serving skew when using a time-windowed aggregate feature requires being able to use the same data preparation code for both historical data (which is batch) and real-time data (which is streaming).

We will solve the fourth problem, of time-windowed aggregates, in Chapter 11 by using Apache Beam and its ability to employ the same code for both batch and stream.

The solution to the first three problems requires a portable machine learning library that is (1) powerful enough to carry out training using ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Data Engineering with Google Cloud Platform

Publisher Resources

ISBN: 9781098118945Errata Page Supplemental Content

Data Science on the Google Cloud Platform, 2nd Edition

by Valliappa Lakshmanan

Chapter 9. Machine Learning with TensorFlow in Vertex AI

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Data Engineering with Google Cloud Platform

Visualizing Google Cloud

Data Science from Scratch, 2nd Edition

Data Analysis with Python and PySpark

Publisher Resources

Chapter 9. Machine Learning with TensorFlow in Vertex AI

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Data Engineering with Google Cloud Platform

Visualizing Google Cloud

Data Science from Scratch, 2nd Edition

Data Analysis with Python and PySpark

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.