3 Exploring and preparing the data set

This chapter covers

  • Getting started with AWS Athena for interactive querying
  • Choosing between manually specified and discovered data schemas
  • Approaching data quality with VACUUM normative principles
  • Analyzing DC taxi data quality through interactive querying
  • Implementing data quality processing in PySpark

In the previous chapter, you imported the DC taxi data set into AWS and stored it in your project’s S3 object storage bucket. You created, configured, and ran an AWS Glue data catalog crawler that analyzed the data set and discovered its data schema. You also learned about the column-based data storage formats (e.g., Apache Parquet) and their advantages over row-based formats for analytical workloads. ...

Get MLOps Engineering at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.