May 2017
Intermediate to advanced
270 pages
6h 18m
English
For numerical and analytical tasks, Spark provides a convenient interface available through the pyspark.sql module (also called SparkSQL). The module includes a spark.sql.DataFrame class that can be used for efficient SQL-style queries similar to those of Pandas. Access to the SQL interface is provided through the SparkSession class:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate()
SparkSession can then be used to create a DataFrame through the function createDataFrame. The function createDataFrame accepts either a RDD, a list, or a pandas.DataFrame.
In the following example, we will create a spark.sql.DataFrame by converting an RDD, rows, which contains a collection of Row instances. The ...
Read now
Unlock full access