O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Spark with R

Book Description

If you’re like most R users, you have deep knowledge and love for statistics. But as your organization continues to collect huge amounts of data, adding tools such as Apache Spark makes a lot of sense. With this practical book, data scientists and professionals working with large-scale data applications will learn how to use Spark from R to tackle big data and big compute problems.

Authors Javier Luraschi, Kevin Kuo, and Edgar Ruiz show you how to use R with Spark to solve different data analysis problems. This book covers relevant data science topics, cluster computing, and issues that should interest even the most advanced users.

  • Analyze, explore, transform, and visualize data in Apache Spark with R
  • Create statistical models to extract information and predict outcomes; automate the process in production-ready workflows
  • Perform analysis and modeling across many machines using distributed computing techniques
  • Use large-scale data from multiple sources and different formats with ease from within Spark
  • Learn about alternative modeling frameworks for graph processing, geospatial analysis, and genomics at scale
  • Dive into advanced topics including custom transformations, real-time data processing, and creating custom Spark extensions

Table of Contents

  1. Foreword
  2. Preface
    1. Formatting
    2. Acknowledgments
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Online Learning
    6. How to Contact Us
  3. 1. Introduction
    1. Overview
    2. Hadoop
    3. Spark
    4. R
    5. sparklyr
    6. Recap
  4. 2. Getting Started
    1. Overview
    2. Prerequisites
      1. Installing sparklyr
      2. Installing Spark
    3. Connecting
    4. Using Spark
      1. Web Interface
      2. Analysis
      3. Modeling
      4. Data
      5. Extensions
      6. Distributed R
      7. Streaming
      8. Logs
    5. Disconnecting
    6. Using RStudio
    7. Resources
    8. Recap
  5. 3. Analysis
    1. Overview
    2. Import
    3. Wrangle
      1. Built-in Functions
      2. Correlations
    4. Visualize
      1. Using ggplot2
      2. Using dbplot
    5. Model
      1. Caching
    6. Communicate
    7. Recap
  6. 4. Modeling
    1. Overview
    2. Exploratory Data Analysis
    3. Feature Engineering
    4. Supervised Learning
      1. Generalized Linear Regression
      2. Other Models
    5. Unsupervised Learning
      1. Data Preparation
      2. Topic Modeling
    6. Recap
  7. 5. Pipelines
    1. Overview
    2. Creation
    3. Use Cases
      1. Hyperparameter Tuning
    4. Operating Modes
    5. Interoperability
    6. Deployment
      1. Batch Scoring
      2. Real-Time Scoring
    7. Recap
  8. 6. Clusters
    1. Overview
    2. On-Premises
      1. Managers
      2. Distributions
    3. Cloud
      1. Amazon
      2. Databricks
      3. Google
      4. IBM
      5. Microsoft
      6. Qubole
    4. Kubernetes
    5. Tools
      1. RStudio
      2. Jupyter
      3. Livy
    6. Recap
  9. 7. Connections
    1. Overview
      1. Edge Nodes
      2. Spark Home
    2. Local
    3. Standalone
    4. YARN
      1. YARN Client
      2. YARN Cluster
    5. Livy
    6. Mesos
    7. Kubernetes
    8. Cloud
    9. Batches
    10. Tools
    11. Multiple Connections
    12. Troubleshooting
      1. Logging
      2. Spark Submit
      3. Windows
    13. Recap
  10. 8. Data
    1. Overview
    2. Reading Data
      1. Paths
      2. Schema
      3. Memory
      4. Columns
    3. Writing Data
    4. Copying Data
    5. File Formats
      1. CSV
      2. JSON
      3. Parquet
      4. Others
    6. File Systems
    7. Storage Systems
      1. Hive
      2. Cassandra
      3. JDBC
    8. Recap
  11. 9. Tuning
    1. Overview
      1. Graph
      2. Timeline
    2. Configuring
      1. Connect Settings
      2. Submit Settings
      3. Runtime Settings
      4. sparklyr Settings
    3. Partitioning
      1. Implicit Partitions
      2. Explicit Partitions
    4. Caching
      1. Checkpointing
      2. Memory
    5. Shuffling
    6. Serialization
    7. Configuration Files
    8. Recap
  12. 10. Extensions
    1. Overview
    2. H2O
    3. Graphs
    4. XGBoost
    5. Deep Learning
    6. Genomics
    7. Spatial
    8. Troubleshooting
    9. Recap
  13. 11. Distributed R
    1. Overview
    2. Use Cases
      1. Custom Parsers
      2. Partitioned Modeling
      3. Grid Search
      4. Web APIs
      5. Simulations
    3. Partitions
    4. Grouping
    5. Columns
    6. Context
    7. Functions
    8. Packages
    9. Cluster Requirements
      1. Installing R
      2. Apache Arrow
    10. Troubleshooting
      1. Worker Logs
      2. Resolving Timeouts
      3. Inspecting Partitions
      4. Debugging Workers
    11. Recap
  14. 12. Streaming
    1. Overview
    2. Transformations
      1. Analysis
      2. Modeling
      3. Pipelines
      4. Distributed R
    3. Kafka
    4. Shiny
    5. Recap
  15. 13. Contributing
    1. Overview
    2. The Spark API
    3. Spark Extensions
    4. Using Scala Code
    5. Recap
  16. A. Supplemental Code References
    1. Preface
      1. Formatting
    2. Chapter 1
      1. The World’s Capacity to Store Information
      2. Daily Downloads of CRAN Packages
    3. Chapter 2
      1. Prerequisites
    4. Chapter 3
      1. Hive Functions
    5. Chapter 4
      1. MLlib Functions
    6. Chapter 6
      1. Google Trends for On-Premises (Mainframes), Cloud Computing, and Kubernetes
    7. Chapter 12
      1. Stream Generator
      2. Installing Kafka
  17. Index