Summary
In this chapter, we discussed the origin of DataFrames and how Spark SQL provides the SQL interface on top of DataFrames. The power of DataFrames is such that execution times have decreased manyfold over original RDD-based computations. Having such a powerful layer with a simple SQL-like interface makes them all the more powerful. We also looked at various APIs to create, and manipulate DataFrames, as well as digging deeper into the sophisticated features of aggregations, including groupBy, Window, rollup, and cubes. Finally, we also looked at the concept of joining datasets and the various types of joins possible, such as inner, outer, cross, and so on.
In the next chapter, we will explore the exciting world of real-time data processing ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access