Chapter 1. Setting Up a Spark Virtual Environment

In this chapter, we will build an isolated virtual environment for development purposes. The environment will be powered by Spark and the PyData libraries provided by the Python Anaconda distribution. These libraries include Pandas, Scikit-Learn, Blaze, Matplotlib, Seaborn, and Bokeh. We will perform the following activities:

  • Setting up the development environment using the Anaconda Python distribution. This will include enabling the IPython Notebook environment powered by PySpark for our data exploration tasks.
  • Installing and enabling Spark, and the PyData libraries such as Pandas, Scikit- Learn, Blaze, Matplotlib, and Bokeh.
  • Building a word count example app to ensure that everything is working ...

Get Spark for Python Developers now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.