The objective of the book is to teach you about PySpark and the PyData libraries by building an app that analyzes the Spark community's interactions on social networks. We will gather information on Apache Spark from GitHub, check the relevant tweets on Twitter, and get a feel for the buzz around Spark in the broader open source software communities using Meetup.
In this chapter, we will outline the various sources of data and information. We will get an understanding of their structure. We will outline the data processing pipeline, from collection to batch and streaming processing.
In this section, we will cover the following points: