O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

R High Performance Programming

Book Description

Overcome performance difficulties in R with a range of exciting techniques and solutions

In Detail

With the increasing use of information in all areas of business and science, R provides an easy and powerful way to analyze and process the vast amounts of data involved. It is one of the most popular tools today for faster data exploration, statistical analysis, and statistical modeling and can generate useful insights and discoveries from large amounts of data.

Through this practical and varied guide, you will become equipped to solve a range of performance problems in R programming. You will learn how to profile and benchmark R programs, identify bottlenecks, assess and identify performance limitations from the CPU, identify memory or disk input/output constraints, and optimize the computational speed of your R programs using great tricks, such as vectorizing computations. You will then move on to more advanced techniques, such as compiling code and tapping into the computing power of GPUs, optimizing memory consumption, and handling larger-than-memory data sets using disk-based memory and chunking.

What You Will Learn

  • Benchmark and profile R programs to solve performance bottlenecks
  • Understand how CPU, memory, and disk input/output constraints can limit the performance of R programs
  • Optimize R code to run faster and use less memory
  • Use compiled code in R and other languages such as C to speed up computations
  • Harness the power of GPUs for computational speed
  • Process data sets that are larger than memory using disk-based memory and chunking
  • Tap into the capacity of multiple CPUs using parallel computing
  • Leverage the power of advanced database systems and Big Data tools from within R

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. R High Performance Programming
    1. Table of Contents
    2. R High Performance Programming
    3. Credits
    4. About the Authors
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    8. 1. Understanding R's Performance – Why Are R Programs Sometimes Slow?
      1. Three constraints on computing performance – CPU, RAM, and disk I/O
      2. R is interpreted on the fly
      3. R is single-threaded
      4. R requires all data to be loaded into memory
      5. Algorithm design affects time and space complexity
      6. Summary
    9. 2. Profiling – Measuring Code's Performance
      1. Measuring total execution time
        1. Measuring execution time with system.time()
        2. Repeating time measurements with rbenchmark
        3. Measuring distribution of execution time with microbenchmark
      2. Profiling the execution time
        1. Profiling a function with Rprof()
        2. The profiling results
      3. Profiling memory utilization
      4. Monitoring memory utilization, CPU utilization, and disk I/O using OS tools
      5. Identifying and resolving bottlenecks
      6. Summary
    10. 3. Simple Tweaks to Make R Run Faster
      1. Vectorization
      2. Use of built-in functions
      3. Preallocating memory
      4. Use of simpler data structures
      5. Use of hash tables for frequent lookups on large data
      6. Seeking fast alternative packages in CRAN
      7. Summary
    11. 4. Using Compiled Code for Greater Speed
      1. Compiling R code before execution
        1. Compiling functions
        2. Just-in-time (JIT) compilation of R code
      2. Using compiled languages in R
        1. Prerequisites
        2. Including compiled code inline
        3. Calling external compiled code
        4. Considerations for using compiled code
          1. R APIs
          2. R data types versus native data types
          3. Creating R objects and garbage collection
          4. Allocating memory for non-R objects
      3. Summary
    12. 5. Using GPUs to Run R Even Faster
      1. General purpose computing on GPUs
      2. R and GPUs
        1. Installing gputools
      3. Fast statistical modeling in R with gputools
      4. Summary
    13. 6. Simple Tweaks to Use Less RAM
      1. Reusing objects without taking up more memory
      2. Removing intermediate data when it is no longer needed
      3. Calculating values on the fly instead of storing them persistently
      4. Swapping active and nonactive data
      5. Summary
    14. 7. Processing Large Datasets with Limited RAM
      1. Using memory-efficient data structures
        1. Smaller data types
        2. Sparse matrices
        3. Symmetric matrices
        4. Bit vectors
      2. Using memory-mapped files and processing data in chunks
        1. The bigmemory package
        2. The ff package
      3. Summary
    15. 8. Multiplying Performance with Parallel Computing
      1. Data parallelism versus task parallelism
      2. Implementing data parallel algorithms
      3. Implementing task parallel algorithms
        1. Running the same task on workers in a cluster
        2. Running different tasks on workers in a cluster
      4. Executing tasks in parallel on a cluster of computers
      5. Shared memory versus distributed memory parallelism
      6. Optimizing parallel performance
      7. Summary
    16. 9. Offloading Data Processing to Database Systems
      1. Extracting data into R versus processing data in a database
      2. Preprocessing data in a relational database using SQL
      3. Converting R expressions to SQL
        1. Using dplyr
        2. Using PivotalR
      4. Running statistical and machine learning algorithms in a database
      5. Using columnar databases for improved performance
      6. Using array databases for maximum scientific-computing performance
      7. Summary
    17. 10. R and Big Data
      1. Understanding Hadoop
      2. Setting up Hadoop on Amazon Web Services
      3. Processing large datasets in batches using Hadoop
        1. Uploading data to HDFS
        2. Analyzing HDFS data with RHadoop
        3. Other Hadoop packages for R
      4. Summary
    18. Index