Let's take a look at the following steps:
- Let's start by making sure that we can access PySpark:
import syssys.path.append('/PATH_TO/spark-2.3.2-bin-hadoop2.7/python/') # Not conda#Careful with Java version#conda install py4j
Be sure to change PATH_TO to whatever path you have for your Spark installation.
- Now, let's import pyspark:
import pyspark as sparkfrom pyspark.sql.functions import col,round as round_
We will be using the round function, but we will rename it to round_ to avoid clashes with the builtin round function.
- Let's connect to our Spark server:
sc = spark.SparkContext('spark://127.0.1.1:7077')
- Now, we will create SQLcontext:
sqlc = spark.SQLContext(sc)
There are other contexts for Spark, and we will discuss ...