How to do it...

Let's take a look at the following steps:

  1. Let's start by making sure that we can access PySpark:
import syssys.path.append('/PATH_TO/spark-2.3.2-bin-hadoop2.7/python/') # Not conda#Careful with Java version#conda install py4j

Be sure to change PATH_TO to whatever path you have for your Spark installation.

  1. Now, let's import pyspark:
import pyspark as sparkfrom pyspark.sql.functions import col,round as round_

We will be using the round function, but we will rename it to round_ to avoid clashes with the builtin round function.

  1. Let's connect to our Spark server:
sc = spark.SparkContext('spark://127.0.1.1:7077')
  1. Now, we will create SQLcontext:
sqlc = spark.SQLContext(sc)

There are other contexts for Spark, and we will discuss ...

Get Bioinformatics with Python Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.