O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Spark for Python Developers

Book Description

A concise guide to implementing Spark Big Data analytics for Python developers, and building a real-time and insightful trend tracker data intensive app

About This Book

  • Set up real-time streaming and batch data intensive infrastructure using Spark and Python
  • Deliver insightful visualizations in a web app using Spark (PySpark)
  • Inject live data using Spark Streaming with real-time events

Who This Book Is For

This book is for data scientists and software developers with a focus on Python who want to work with the Spark engine, and it will also benefit Enterprise Architects. All you need to have is a good background of Python and an inclination to work with Spark.

What You Will Learn

  • Create a Python development environment powered by Spark (PySpark), Blaze, and Bookeh
  • Build a real-time trend tracker data intensive app
  • Visualize the trends and insights gained from data using Bookeh
  • Generate insights from data using machine learning through Spark MLLIB
  • Juggle with data using Blaze
  • Create training data sets and train the Machine Learning models
  • Test the machine learning models on test datasets
  • Deploy the machine learning algorithms and models and scale it for real-time events

In Detail

Looking for a cluster computing system that provides high-level APIs? Apache Spark is your answer—an open source, fast, and general purpose cluster computing system. Spark's multi-stage memory primitives provide performance up to 100 times faster than Hadoop, and it is also well-suited for machine learning algorithms.

Are you a Python developer inclined to work with Spark engine? If so, this book will be your companion as you create data-intensive app using Spark as a processing engine, Python visualization libraries, and web frameworks such as Flask.

To begin with, you will learn the most effective way to install the Python development environment powered by Spark, Blaze, and Bookeh. You will then find out how to connect with data stores such as MySQL, MongoDB, Cassandra, and Hadoop.

You’ll expand your skills throughout, getting familiarized with the various data sources (Github, Twitter, Meetup, and Blogs), their data structures, and solutions to effectively tackle complexities. You’ll explore datasets using iPython Notebook and will discover how to optimize the data models and pipeline. Finally, you’ll get to know how to create training datasets and train the machine learning models.

By the end of the book, you will have created a real-time and insightful trend tracker data-intensive app with Spark.

Style and approach

This is a comprehensive guide packed with easy-to-follow examples that will take your skills to the next level and will get you up and running with Spark.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Spark for Python Developers
    1. Table of Contents
    2. Spark for Python Developers
    3. Credits
    4. About the Author
    5. Acknowledgment
    6. About the Reviewers
    7. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    9. 1. Setting Up a Spark Virtual Environment
      1. Understanding the architecture of data-intensive applications
        1. Infrastructure layer
        2. Persistence layer
        3. Integration layer
        4. Analytics layer
        5. Engagement layer
      2. Understanding Spark
        1. Spark libraries
          1. PySpark in action
          2. The Resilient Distributed Dataset
      3. Understanding Anaconda
      4. Setting up the Spark powered environment
        1. Setting up an Oracle VirtualBox with Ubuntu
        2. Installing Anaconda with Python 2.7
        3. Installing Java 8
        4. Installing Spark
        5. Enabling IPython Notebook
      5. Building our first app with PySpark
      6. Virtualizing the environment with Vagrant
      7. Moving to the cloud
        1. Deploying apps in Amazon Web Services
        2. Virtualizing the environment with Docker
      8. Summary
    10. 2. Building Batch and Streaming Apps with Spark
      1. Architecting data-intensive apps
        1. Processing data at rest
        2. Processing data in motion
        3. Exploring data interactively
      2. Connecting to social networks
        1. Getting Twitter data
        2. Getting GitHub data
        3. Getting Meetup data
      3. Analyzing the data
        1. Discovering the anatomy of tweets
      4. Exploring the GitHub world
        1. Understanding the community through Meetup
      5. Previewing our app
      6. Summary
    11. 3. Juggling Data with Spark
      1. Revisiting the data-intensive app architecture
      2. Serializing and deserializing data
      3. Harvesting and storing data
        1. Persisting data in CSV
        2. Persisting data in JSON
        3. Setting up MongoDB
          1. Installing the MongoDB server and client
          2. Running the MongoDB server
          3. Running the Mongo client
          4. Installing the PyMongo driver
          5. Creating the Python client for MongoDB
        4. Harvesting data from Twitter
      4. Exploring data using Blaze
        1. Transferring data using Odo
      5. Exploring data using Spark SQL
        1. Understanding Spark dataframes
        2. Understanding the Spark SQL query optimizer
        3. Loading and processing CSV files with Spark SQL
        4. Querying MongoDB from Spark SQL
      6. Summary
    12. 4. Learning from Data Using Spark
      1. Contextualizing Spark MLlib in the app architecture
      2. Classifying Spark MLlib algorithms
        1. Supervised and unsupervised learning
        2. Additional learning algorithms
      3. Spark MLlib data types
      4. Machine learning workflows and data flows
        1. Supervised machine learning workflows
        2. Unsupervised machine learning workflows
      5. Clustering the Twitter dataset
        1. Applying Scikit-Learn on the Twitter dataset
        2. Preprocessing the dataset
        3. Running the clustering algorithm
        4. Evaluating the model and the results
      6. Building machine learning pipelines
      7. Summary
    13. 5. Streaming Live Data with Spark
      1. Laying the foundations of streaming architecture
        1. Spark Streaming inner working
        2. Going under the hood of Spark Streaming
        3. Building in fault tolerance
      2. Processing live data with TCP sockets
        1. Setting up TCP sockets
        2. Processing live data
      3. Manipulating Twitter data in real time
        1. Processing Tweets in real time from the Twitter firehose
      4. Building a reliable and scalable streaming app
        1. Setting up Kafka
          1. Installing and testing Kafka
          2. Developing producers
          3. Developing consumers
          4. Developing a Spark Streaming consumer for Kafka
        2. Exploring flume
        3. Developing data pipelines with Flume, Kafka, and Spark
      5. Closing remarks on the Lambda and Kappa architecture
        1. Understanding Lambda architecture
        2. Understanding Kappa architecture
      6. Summary
    14. 6. Visualizing Insights and Trends
      1. Revisiting the data-intensive apps architecture
      2. Preprocessing the data for visualization
      3. Gauging words, moods, and memes at a glance
        1. Setting up wordcloud
        2. Creating wordclouds
      4. Geo-locating tweets and mapping meetups
        1. Geo-locating tweets
        2. Displaying upcoming meetups on Google Maps
      5. Summary
    15. Index