book

Data Science on the Google Cloud Platform, 2nd Edition

by Valliappa Lakshmanan

March 2022

Beginner to intermediate

459 pages

12h 19m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Many Similar DecisionsThe Role of Data ScientistsScrappy EnvironmentFull Stack Cloud Data ScientistsCollaborationBest PracticesSimple to Complex SolutionsCloud ComputingServerlessA Probabilistic DecisionProbabilistic ApproachProbability Density FunctionCumulative Distribution FunctionChoices MadeChoosing CloudNot a Reference BookGetting Started with the CodeAgile Architecture for Data Science on Google CloudWhat Is Agile Architecture?No-Code, Low-CodeUse Managed ServicesSummarySuggested Resources
Airline On-Time Performance DataKnowabilityCausalityTraining–Serving SkewDownloading DataHub-and-Spoke ArchitectureDataset FieldsSeparation of Compute and StorageScaling UpScaling Out with Sharded DataScaling Out with Data-in-PlaceIngesting DataReverse Engineering a Web FormDataset DownloadExploration and CleanupUploading Data to Google Cloud StorageLoading Data into Google BigQueryAdvantages of a Serverless Columnar DatabaseStaging on Cloud StorageAccess ControlIngesting CSV FilesPartitioningScheduling Monthly DownloadsIngesting in PythonCloud RunSecuring Cloud RunDeploying and Invoking Cloud RunScheduling Cloud RunSummaryCode BreakSuggested Resources
Explain Your Model with DashboardsWhy Build a Dashboard First?Accuracy, Honesty, and Good DesignLoading Data into Cloud SQLCreate a Google Cloud SQL InstanceCreate Table of DataInteracting with the DatabaseQuerying Using BigQuerySchema ExplorationUsing PreviewUsing Table ExplorerCreating BigQuery ViewBuilding Our First ModelContingency TableThreshold OptimizationBuilding a DashboardGetting Started with Data StudioCreating ChartsAdding End-User ControlsShowing Proportions with a Pie ChartExplaining a Contingency TableModern Business IntelligenceDigitizationNatural Language QueriesConnected SheetsSummarySuggested Resources
Designing the Event FeedTransformations NeededArchitectureGetting Airport InformationSharing DataTime CorrectionApache Beam/Cloud DataflowParsing Airports DataAdding Time Zone InformationConverting Times to UTCCorrecting DatesCreating EventsReading and Writing to the CloudRunning the Pipeline in the CloudPublishing an Event Stream to Cloud Pub/SubSpeed-Up FactorGet Records to PublishHow Many Topics?Iterating Through RecordsBuilding a Batch of EventsPublishing a Batch of EventsReal-Time Stream ProcessingStreaming in DataflowWindowing a PipelineStreaming AggregationUsing Event TimestampsExecuting the Stream ProcessingAnalyzing Streaming Data in BigQueryReal-Time DashboardSummarySuggested Resources
Exploratory Data AnalysisExploration with SQLReading a Query ExplanationExploratory Data Analysis in Vertex AI WorkbenchJupyter NotebooksCreating a NotebookJupyter CommandsInstalling PackagesJupyter Magic for Google CloudExploring Arrival DelaysBasic StatisticsPlotting DistributionsQuality ControlArrival Delay Conditioned on Departure DelayEvaluating the ModelRandom ShufflingSplitting by DateTraining and TestingSummarySuggested Resources
MapReduce and the Hadoop EcosystemHow MapReduce WorksApache HadoopGoogle Cloud DataprocNeed for Higher-Level ToolsJobs, Not ClustersPreinstalling SoftwareQuantization Using Spark SQLJupyterLab on Cloud DataprocIndependence Check Using BigQuerySpark SQL in JupyterLabHistogram EqualizationBayesian ClassificationBayes in Each BinEvaluating the ModelDynamically Resizing ClustersComparing to Single Threshold ModelOrchestrationSubmitting a Spark JobWorkflow TemplateCloud ComposerAutoscalingServerless SparkSummarySuggested Resources
Logistic RegressionHow Logistic Regression WorksSpark ML LibraryGetting Started with Spark Machine LearningSpark Logistic RegressionCreating a Training DatasetTraining the ModelPredicting Using the ModelEvaluating a ModelFeature EngineeringExperimental FrameworkFeature SelectionFeature TransformationsFeature CreationCategorical VariablesRepeatable, Real TimeSummarySuggested Resources
Logistic RegressionPresplit DataInterrogating the ModelEvaluating the ModelScale and SimplicityNonlinear Machine LearningXGBoostHyperparameter TuningVertex AI AutoML TablesTime Window FeaturesTaxi-Out TimeCompounding DelaysCausalityTime FeaturesDeparture HourTransform ClauseCategorical VariableFeature CrossSummarySuggested Resources
Toward More Complex ModelsPreparing BigQuery Data for TensorFlowReading Data into TensorFlowTraining and Evaluation in KerasModel FunctionFeaturesInputsTraining the Keras ModelSaving and ExportingDeep Neural NetworkWide-and-Deep Model in KerasRepresenting Air Traffic CorridorsBucketingFeature CrossingWide-and-Deep ClassifierDeploying a Trained TensorFlow Model to Vertex AIConceptsUploading ModelCreating EndpointDeploying Model to EndpointInvoking the Deployed ModelSummarySuggested Resources

Developing and Deploying Using PythonWriting model.pyWriting the Training PipelinePredefined SplitAutoMLHyperparameter TuningParameterize ModelShorten Training RunMetrics During TrainingHyperparameter Tuning PipelineBest Trial to CompletionExplaining the ModelConfiguring Explanations MetadataCreating and Deploying ModelObtaining ExplanationsSummarySuggested Resources
Time AveragesApache Beam and Cloud DataflowReading and WritingTime WindowingMachine Learning TrainingMachine Learning DatasetTraining the ModelStreaming PredictionsReuse TransformsInput and OutputInvoking ModelReusing EndpointBatching PredictionsStreaming PipelineWriting to BigQueryExecuting Streaming PipelineLate and Out-of-Order RecordsPossible Streaming SinksSummarySuggested Resources
Four Years of DataCreating DatasetTraining ModelEvaluationSummarySuggested Resources
Handling Sensitive InformationSensitive Data in ColumnsSensitive Data in Natural Language DatasetsSensitive Data in Free-Form Unstructured DataSensitive Data in a Combination of FieldsSensitive Data in Unstructured ContentProtecting Sensitive DataRemoving Sensitive DataMasking Sensitive DataCoarsening Sensitive DataEstablishing a Governance Policy

Content preview from Data Science on the Google Cloud Platform, 2nd Edition

Chapter 7. Logistic Regression Using Spark ML

In Chapter 6, we created a model based on two variables—distance and departure delay—to predict the probability that a flight will be more than 15 minutes late. We found that we could get a finer-grained decision if we used a second variable (distance) instead of using just one variable (departure delay).

Why not use all the variables in the dataset? Or at least many more of them? In particular, I’d like to use the TAXI_OUT variable—if it is too high, the flight will be stuck on the runway waiting for the airport tower to allow the plane to take off, and so the flight is likely to be delayed. The Naive Bayes approach in Chapter 6 was quite limiting in terms of being able to incorporate additional variables. As we add variables, we would need to continue slicing the dataset into smaller and smaller bins. We would then find that many of our bins would contain very few samples, resulting in decision surfaces that would not be well behaved. Remember that, after we binned the data by distance, we found that the departure delay decision boundary was quite well behaved—departure delays above a certain threshold were associated with the flight not arriving on time. Our simplification of the Bayesian classification surface to a simple threshold that varied by bin would not have been possible if the decision boundary had been noisier.¹ The more variables we use, the more bins we will have, and this good behavior will begin to break down. This ...