© Robert Ilijason  2020
R. IlijasonBeginning Apache Spark Using Azure Databrickshttps://doi.org/10.1007/978-1-4842-5781-4_8

8. ETL and Advanced Data Wrangling

Robert Ilijason1 
(1)
Viken, Sweden
 

In this chapter, it’s time to dig a little deeper into Python tricks that’ll make your life easier. We’ll revisit a lot of topics that we’ve already talked about, but take them a step further. First up, we’ll remind ourselves of why this is important.

After we’ve reacquainted ourselves with ETL, we’ll look into the Spark UI and how that tool can help us monitor what’s happening in the system when we run a query. Then we’ll take a deep dive into a lot of new functions and features available in Pyspark.

Finally, we’ll look at how to handle data stored on the file ...

Get Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.