We start with importing the relevant packages that will be used. Since the data is very big, we may choose to use Spark.
Spark is an open source distributed cluster-computing system that is used for handling big data:
import osimport sysimport reimport timefrom pyspark import SparkContextfrom pyspark import SparkContextfrom pyspark.sql import SQLContextfrom pyspark.sql.types import *from pyspark.sql import Row# from pyspark.sql.functions import *%matplotlib inlineimport matplotlib.pyplot as pltimport pandas as pdimport numpy as npimport pyspark.sql.functions as funcimport matplotlib.patches as mpatchesfrom operator import addfrom pyspark.mllib.clustering import KMeans, KMeansModelfrom operator import add ...