Setting up the Spark powered environment

In this section, we will learn to set up Spark:

  • Create a segregated development environment in a virtual machine running on Ubuntu 14.04, so it does not interfere with any existing system.
  • Install Spark 1.3.0 with its dependencies, namely.
  • Install the Anaconda Python 2.7 environment with all the required libraries such as Pandas, Scikit-Learn, Blaze, and Bokeh, and enable PySpark, so it can be accessed through IPython Notebooks.
  • Set up the backend or data stores of our environment. We will use MySQL as the relational database, MongoDB as the document store, and Cassandra as the columnar database.

Each storage backend serves a specific purpose depending on the nature of the data to be handled. The MySQL RDBMs ...

Get Spark for Python Developers now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.