Video description
If you have some Python experience, and you want to take it to the next level, this practical, hands-on course will be a helpful resource. Video tutorials in this course will show you how to use Python for distributed task processing, and perform large-scale data processing in Spark using the PySpark API.
Publisher resources
Table of contents
-
Building Data Pipelines with Python
- Welcome To The Course
- About The Author
- Introduction To Automation
- Adventures With Servers
- Being A Good Systems Caretaker
- What Is A Queue?
- What Is A Consumer? What Is A Producer?
- Why Celery?
- Celery Architecture Set Up
- Writing Your First Tasks
- Deploying Your Tasks
- Scaling Your Workers
- Monitoring With Flower
- Advanced Celery Features
- Why Dask?
- First Steps With Dask
- Dask Bags
- Dask Distributed
- What Are Data Pipelines? What Is Dag?
- Luigi And Airflow: A Comparison
- First Steps With Luigi
- More Complex Luigi Tasks
- Introduction To Hadoop
- First Steps With Airflow
- Custom Tasks With Airflow
- Advanced Airflow: Subdags And Branches
- Using Luigi With Hadoop
- Apache Spark
- Apache Spark Streaming
- Django Channels
- And Many More
- Introduction To Testing With Python
- Property-Based Testing With Hypothesis
- What's Next?
-
Introduction to PySpark
- Introduction And Course Overview
- About The Author
- Installing Python
- Installing iPython And Using Notebooks
- Download And Setup
- Running The Spark Shell
- Running The Spark Shell With iPython
- What Is A Resilient Distributed Dataset - RDD?
- Reading A Text File
- Actions
- Transformations
- Persisting Data
- Map
- Filter
- Flatmap
- MapPartitions
- MapPartitionsWithIndex
- Sample
- Union
- Intersection
- Distinct
- Cartesian
- Pipe
- Coalesce
- Repartition
- RepartitionAndSortWithinPartitions
- Reduce
- Collect
- Count
- First
- Take
- TakeSample
- TakeOrdered
- SaveAsTextFile
- CountByKey
- ForEach
- GroupByKey
- ReduceByKey
- AggregateByKey
- SortByKey
- Join
- CoGroup
- WholeTextFile
- Pickle Files
- HadoopInputFormat
- HadoopOutputFormat
- Broadcast Variables
- Accumulators
- Using A Custom Accumulator
- Partitioning
- Spark Standalone Cluster
- Mesos
- Yarn
- Client Versus Cluster Mode
- Spark Streaming
- Dataframes And SQL
- MLlib
- Resources And Where To Go From Here
- Wrap Up
Product information
- Title: Scaling Python for Big Data
- Author(s):
- Release date: December 2016
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491977798
You might also like
video
Python for Data Analytics
According to the latest O’Reilly Data Science Salary Survey, Python is one of the tools that …
video
Building Data Pipelines with Python
This course shows you how to build data pipelines and automate workflows using Python 3. From …
book
Scaling Python with Ray
Serverless computing enables developers to concentrate solely on their applications rather than worry about where they've …
book
Introduction to Machine Learning with Python
Machine learning has become an integral part of many commercial applications and research projects, but this …