Use PySpark to productionize analytics over Big Data and easily crush messy data at scale
About This Video
- Work with large amounts of data with agility using distributed datasets and in-memory caching
- Source data from all popular data hosting platforms, including HDFS, Hive, JSON, and S3
- Deploy Big Data analytics to production using PySpark’s easy to use API
Data is an incredible asset, especially when there are lots of it. Exploratory data analysis, business intelligence, and machine learning all depend on processing and analyzing Big Data at scale.
How do you go from working on prototypes on your local machine, to handling messy data in production and at scale?
This is a practical, hands-on course that shows you how to use Spark and it's Python API to create performant analytics with large-scale data. Don't reinvent the wheel, and wow your clients by building robust and responsible applications on Big Data.
All the code and supporting files for this course are available on Github at - https://github.com/PacktPublishing/Hands-On-Pyspark-for-Big-Data-Analysis
Table of Contents
- Chapter 1 : Install PySpark and Setup Your Development Environment
- Chapter 2 : Getting Your Big Data into the Spark Environment Using RDDs
- Chapter 3 : Big Data Cleaning and Wrangling with Spark Notebooks
- Chapter 4 : Aggregating and Summarizing Data into Useful Reports
- Chapter 5 : Powerful Exploratory Data Analysis with MLlib
- Chapter 6 : Putting Structure on Your Big Data with SparkSQL
- Title: Hands-On PySpark for Big Data Analysis
- Release date: December 2018
- Publisher(s): Packt Publishing
- ISBN: 9781789530056