Chapter 15.  Understanding Data Processing using Apache Spark

In this chapter, we will present the main features of data processing architecture and the Cloudera platform distribution. Then, we will explore how to use a distributed filesystem and how to managing files from terminal and using a web interface. Finally, we will describe the use of Apache Spark, which is an open source, big data processing framework built with the goal of being fast and easy to use. Apache Spark provides us with a unified framework to manage big data processing requirements, such as data streaming, machine learning, and analytics.

In this chapter, we will cover these topics:

  • Understanding data processing
  • Platform for data processing
  • An introduction to the distributed ...

Get Practical Data Analysis - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.