Raju Kumar MishraPySpark Recipeshttps://doi.org/10.1007/978-1-4842-3141-8_8

8. PySparkSQL

Raju Kumar Mishra¹

(1)

Bangalore, Karnataka, India

Most data that a data scientist deals with is either structured or semistructured. Nowadays, the amount of unstructured data is increasing rapidly. The PySparkSQL module is a higher-level abstraction over PySpark Core for processing structured and semistructured datasets. By using PySparkSQL, we can use SQL and HiveQL code too, which makes this module popular among database programmers and Apache Hive users. The APIs provided by PySparkSQL are optimized. PySparkSQL can read data from many file types such as CSV files, JSON files, and files from other databases.

The DataFrame abstraction ...

Get PySpark Recipes: A Problem-Solution Approach with PySpark2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

PySpark Recipes: A Problem-Solution Approach with PySpark2 by Raju Kumar Mishra

8. PySparkSQL

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly