4 More exploratory data analysis and data preparation

This chapter covers

  • Analyzing summary statistics of the DC taxi data set
  • Evaluating alternative data set sizes for machine learning
  • Using statistical measures to choose the right machine learning data set size
  • Implementing data set sampling in a PySpark job

In the last chapter, you started with the analysis of the DC taxi fare data set. After the data set was converted to an analysis-friendly Apache Parquet format, you crawled the data schema and used the Athena interactive querying service to explore the data. These first steps of data exploration surfaced numerous data quality issues, motivating you to establish a rigorous approach to deal with the garbage in, garbage out problem in your ...

Get MLOps Engineering at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.