December 2018
Intermediate to advanced
318 pages
8h 28m
English
We start with importing the relevant packages that will be used. Since the data is very big, we may choose to use Spark.
Spark is an open source distributed cluster-computing system that is used for handling big data:
import osimport sysimport reimport timefrom pyspark import SparkContextfrom pyspark import SparkContextfrom pyspark.sql import SQLContextfrom pyspark.sql.types import *from pyspark.sql import Row# from pyspark.sql.functions import *%matplotlib inlineimport matplotlib.pyplot as pltimport pandas as pdimport numpy as npimport pyspark.sql.functions as funcimport matplotlib.patches as mpatchesfrom operator import addfrom pyspark.mllib.clustering import KMeans, KMeansModelfrom operator import add ...
Read now
Unlock full access