3 Exploring and preparing the data set

This chapter covers

Getting started with AWS Athena for interactive querying
Choosing between manually specified and discovered data schemas
Approaching data quality with VACUUM normative principles
Analyzing DC taxi data quality through interactive querying
Implementing data quality processing in PySpark

In the previous chapter, you imported the DC taxi data set into AWS and stored it in your project’s S3 object storage bucket. You created, configured, and ran an AWS Glue data catalog crawler that analyzed the data set and discovered its data schema. You also learned about the column-based data storage formats (e.g., Apache Parquet) and their advantages over row-based formats for analytical workloads. ...

Get MLOps Engineering at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

MLOps Engineering at Scale by Carl Osipov

3 Exploring and preparing the data set

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly