O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Science with Python and Dask

Book Description

Data Science with Python and Dask teaches you to build scalable projects that can handle massive datasets. After meeting the Dask framework, you’ll analyze data in the NYC Parking Ticket database and use DataFrames to streamline your process. Then, you’ll create machine learning models using Dask-ML, build interactive visualizations, and build clusters using AWS and Docker.

Table of Contents

  1. Cover
  2. Titlepage
  3. Copyright
  4. contents in brief
  5. contents
  6. Dedication
  7. preface
  8. acknowledgments
  9. about this book
    1. Who should read this book
    2. How this book is organized: A roadmap
    3. About the code
    4. liveBook discussion forum
  10. about the author
  11. about the cover illustration
  12. Part 1: The building blocks of scalable computing
    1. Chapter 1: Why scalable computing matters
      1. 1.1 Why Dask?
      2. 1.2 Cooking with DAGs
      3. 1.3 Scaling out, concurrency, and recovery
        1. 1.3.1 Scaling up vs. scaling out
        2. 1.3.2 Concurrency and resource management
        3. 1.3.3 Recovering from failures
      4. 1.4 Introducing a companion dataset
      5. Summary
    2. Chapter 2: Introducing Dask
      1. 2.1 Hello Dask: A first look at the DataFrame API
        1. 2.1.1 Examining the metadata of Dask objects
        2. 2.1.2 Running computations with the compute method
        3. 2.1.3 Making complex computations more efficient with persist
      2. 2.2 Visualizing DAGs
        1. 2.2.1 Visualizing a simple DAG using Dask Delayed objects
        2. 2.2.2 Visualizing more complex DAGs with loops and collections
        3. 2.2.3 Reducing DAG complexity with persist
      3. 2.3 Task scheduling
        1. 2.3.1 Lazy computations
        2. 2.3.2 Data locality
      4. Summary
  13. Part 2: Working with structured data using Dask DataFrames
    1. Chapter 3: Introducing Dask DataFrames
      1. 3.1 Why use DataFrames?
      2. 3.2 Dask and Pandas
        1. 3.2.1 Managing DataFrame partitioning
        2. 3.2.2 What is the shuffle?
      3. 3.3 Limitations of Dask DataFrames
      4. Summary
    2. Chapter 4: Loading data into DataFrames
      1. 4.1 Reading data from text files
        1. 4.1.1 Using Dask datatypes
        2. 4.1.2 Creating schemas for Dask DataFrames
      2. 4.2 Reading data from relational databases
      3. 4.3 Reading data from HDFS and S3
      4. 4.4 Reading data in Parquet format
      5. Summary
    3. Chapter 5: Cleaning and transforming DataFrames
      1. 5.1 Working with indexes and axes
        1. 5.1.1 Selecting columns from a DataFrame
        2. 5.1.2 Dropping columns from a DataFrame
        3. 5.1.3 Renaming columns in a DataFrame
        4. 5.1.4 Selecting rows from a DataFrame
      2. 5.2 Dealing with missing values
        1. 5.2.1 Counting missing values in a DataFrame
        2. 5.2.2 Dropping columns with missing values
        3. 5.2.3 Imputing missing values
        4. 5.2.4 Dropping rows with missing data
        5. 5.2.5 Imputing multiple columns with missing values
      3. 5.3 Recoding data
      4. 5.4 Elementwise operations
      5. 5.5 Filtering and reindexing DataFrames
      6. 5.6 Joining and concatenating DataFrames
        1. 5.6.1 Joining two DataFrames
        2. 5.6.2 Unioning two DataFrames
      7. 5.7 Writing data to text files and Parquet files
        1. 5.7.1 Writing to delimited text files
        2. 5.7.2 Writing to Parquet files
      8. Summary
    4. Chapter 6: Summarizing and analyzing DataFrames
      1. 6.1 Descriptive statistics
        1. 6.1.1 What are descriptive statistics?
        2. 6.1.2 Calculating descriptive statistics with Dask
        3. 6.1.3 Using the describe method for descriptive statistics
      2. 6.2 Built-In aggregate functions
        1. 6.2.1 What is correlation?
        2. 6.2.2 Calculating correlations with Dask DataFrames
      3. 6.3 Custom aggregate functions
        1. 6.3.1 Testing categorical variables with the t-test
        2. 6.3.2 Using custom aggregates to implement the Brown-Forsythe test
      4. 6.4 Rolling (window) functions
        1. 6.4.1 Preparing data for a rolling function
        2. 6.4.2 Using the rolling method to apply a window function
      5. Summary
    5. Chapter 7: Visualizing DataFrames with Seaborn
      1. 7.1 The prepare-reduce-collect-plot pattern
      2. 7.2 Visualizing continuous relationships with scatterplot and regplot
        1. 7.2.1 Creating a scatterplot with Dask and Seaborn
        2. 7.2.2 Adding a linear regression line to the scatterplot
        3. 7.2.3 Adding a nonlinear regression line to a scatterplot
      3. 7.3 Visualizing categorical relationships with violinplot
        1. 7.3.1 Creating a violinplot with Dask and Seaborn
        2. 7.3.2 Randomly sampling data from a Dask DataFrame
      4. 7.4 Visualizing two categorical relationships with heatmap
      5. Summary
    6. Chapter 8: Visualizing location data with Datashader
      1. 8.1 What is Datashader and how does it work?
        1. 8.1.1 The five stages of the Datashader rendering pipeline
        2. 8.1.2 Creating a Datashader Visualization
      2. 8.2 Plotting location data as an interactive heatmap
        1. 8.2.1 Preparing geographic data for map tiling
        2. 8.2.2 Creating the interactive heatmap
      3. Summary
  14. Part 3: Extending and deploying Dask
    1. Chapter 9: Working with Bags and Arrays
      1. 9.1 Reading and parsing unstructured data with Bags
        1. 9.1.1 Selecting and viewing data from a Bag
        2. 9.1.2 Common parsing issues and how to overcome them
        3. 9.1.3 Working with delimiters
      2. 9.2 Transforming, filtering, and folding elements
        1. 9.2.1 Transforming elements with the map method
        2. 9.2.2 Filtering Bags with the filter method
        3. 9.2.3 Calculating descriptive statistics on Bags
        4. 9.2.4 Creating aggregate functions using the foldby method
      3. 9.3 Building Arrays and DataFrames from Bags
      4. 9.4 Using Bags for parallel text analysis with NLTK
        1. 9.4.1 The basics of bigram analysis
        2. 9.4.2 Extracting tokens and filtering stopwords
        3. 9.4.3 Analyzing the bigrams
      5. Summary
    2. Chapter 10: Machine learning with Dask-ML
      1. 10.1 Building linear models with Dask-ML
        1. 10.1.1 Preparing the data with binary vectorization
        2. 10.1.2 Building a logistic regression model with Dask-ML
      2. 10.2 Evaluating and tuning Dask-ML models
        1. 10.2.1 Evaluating Dask-ML models with the score method
        2. 10.2.2 Building a naïve Bayes classifier with Dask-ML
        3. 10.2.3 Automatically tuning hyperparameters
      3. 10.3 Persisting Dask-ML models
      4. Summary
    3. Chapter 11: Scaling and deploying Dask
      1. 11.1 Building a Dask cluster on Amazon AWS with Docker
        1. 11.1.1 Getting started
        2. 11.1.2 Creating a security key
        3. 11.1.3 Creating the ECS cluster
        4. 11.1.4 Configuring the cluster’s networking
        5. 11.1.5 Creating a shared data drive in Elastic File System
        6. 11.1.6 Allocating space for Docker images in Elastic Container Repository
        7. 11.1.7 Building and deploying images for scheduler, worker, and notebook
        8. 11.1.8 Connecting to the cluster
      2. 11.2 Running and monitoring Dask jobs on a cluster
      3. 11.3 Cleaning up the Dask cluster on AWS
      4. Summary
  15. appendix: Software installation
    1. Installing additional packages with Anaconda
    2. Installing packages without Anaconda
    3. Starting a Jupyter Notebook server
    4. Configuring NLTK
  16. Index
  17. List of Figures
  18. List of Tables
  19. List of Listings