March 2019
Beginner to intermediate
182 pages
4h 6m
English
In this section, we will learn more about DataFrames and learn how to use Spark SQL.
The Spark SQL interface is very simple. For this reason, taking away labels means that we are in unsupervised learning territory. Also, Spark has great support for clustering and dimensionality reduction algorithms. We can tackle learning problems effectively by using Spark SQL to give big data a structure.
Let's take a look at the code that we will be using in our Jupyter Notebook. To maintain consistency, we will be using the same KDD cup data:
raw_data = sc.textFile("./kddcup.data.gz")
Read now
Unlock full access