Scalable data science with R
You’ve got three options: Scaling up, scaling out, or using R as an abstraction layer.
R is among the top five data science tools in use today according O’Reilly research; the latest kdnuggets survey puts it in first, and IEEE Spectrum ranks it as the fifth most popular programming language.
The latest Rexer Data Miner survey revealed that in the past eight years, there has been an three-fold increase in the number of respondents using R, and a seven-fold increase in the number of analysts/scientists who have said that R is their primary tool.
Despite its popularity, the main drawback of vanilla R is its inherently “single threaded” nature and its need to fit all the data being processed in RAM. But nowadays, data sets are typically in the range of GBs and they are growing quickly to TBs. In short, current growth in data volume and variety is demanding more efficient tools by data scientists.
Every data science analysis starts with preparing, cleaning, and transforming the raw input data into some tabular data that can be further used in machine learning models.
In the particular case of R, data size problems usually arise when the input data do not fit in the RAM of the machine and when data analysis takes a long time because parallelism does not happen automatically. Without making the data smaller (through sampling, for example) this problem can be solved in two different ways:
- Scaling-out vertically, by using a machine with more available RAM. For some data scientists leveraging cloud environments like AWS, this can be as easy as changing the instance type of the machine (for example, AWS recently provided an instance with 2TB of RAM). However most companies today are using their internal data infrastructure that relies on commodity hardware to analyze data—they’ll have more difficulty increasing their available RAM.
- Scaling-out horizontally: In this context, it is necessary to change the default R behaviour of loading all required data in memory and access the data differently by using a distributed or parallel schema with a divide-and-conquer (or in R terms, split-apply-combine) approach like MapReduce.
While the first approach is obvious and can use the same code to deal with different data sizes, it can only scale to the memory limits of the machine being used. The second approach, by contrast, is more powerful but it is also more difficult to set up and adapt to existing legacy code.
There is a third approach—scaling-out horizontally can be solved by using R as an interface to the most popular distributed paradigms:
- Hadoop: through using the set of libraries or packages known as RHadoop. These R packages allow users to analyze data with Hadoop through R code. They consist on rhdfs to interact with HDFS systems; rhbase to connect with HBase; plyrmr to perform common data transformation operations over large datasets; rmr2 that provides a map-reduce API; and ravro that writes and reads avro files.
- Spark: with SparkR it is possible to use Spark’s distributed computation engine to enable large-scale data analysis from the R shell. It provides a distributed data frame implementation that supports operations like selection, filtering, aggregation, etc., on large data sets.
- Programming with Big Data in R: (pbdr) is based on MPI and can be used on high-performance computing (HPC) systems, providing a true parallel programming environment in R.
Novel distributed platforms also combine batch and stream processing, providing a SQL-like expression language—for instance, Apache Flink. There are also higher levels of abstraction that allow you to create a data processing language, such as the recently open sourced project Apache Beam from Google. However, these novel projects are still under development, and so far do not include R support.
After the data preparation step, the next common data science phase consists of training machine learning models, which can also be performed on a single machine or distributed among different machines. In the case of distributed machine learning frameworks, the most popular approaches using R, are the following:
- Spark MLlib: through SparkR, some of the machine learning functionalities of Spark are exported in the R package. In particular, the following machine learning models are supported from R: generalized linear model (GLM), survival regression, naive Bayes and k-means.
- H2o framework: a Java-based framework that allows building scalable machine learning models in R or Python. It can run as standalone platform or with an existing Hadoop or Spark implementation. It provides a variety of supervised learning models, such as GLM, gradient boosting machine (GBM), deep learning, Distributed Random Forest, naive Bayes and unsupervised learning implementations like PCA and k-means.
Sidestepping the coding and customization issues of these approaches, you can seek out a commercial solution that uses R to access data on the front-end but uses its own big-data-native processing under the hood.
- Teradata Aster R is a massively parallel processing (MPP) analytic solution that facilitates the data preparation and modeling steps in a scalable way using R. It supports a variety of data sources (text, numerical, time series, graphs) and provides an R interface to Aster’s data science library that scales by using a distributed/parallel environment, avoiding the technical complexities to the user. Teradata also has a partnership with Revolution Analytics (now Microsoft R) where users can execute R code inside of Teradata’s platform.
- HP Vertica is similar to Aster, but it provides On-Line Analytical Processing (OLAP) optimized for large fact tables, whereas Teradata provides On-Line Transaction Processing (OLTP) or OLAP that can handle big volumes of data. To scale out R applications, HP Vertica relies on the open source project distributed R.
- Oracle also includes an R interface in its advanced analytics solution, known as Oracle R Advanced Analytics for Hadoop (ORAAH), and it provides an interface to interact with HDFS and access to Spark MLlib algorithms.
Teradata has also released an open source package in CRAN called toaster that allows users to compute, analyze, and visualize data with (on top of) the Teradata Aster database. It allows computing data in Aster by taking advantage of Aster distributed and parallel engines, and then creates visualizations of the results directly in R. For example, it allows users to execute K-Means or run several cross-validation iterations of a linear regression model in parallel.
Also related is MADlib, an open source library for scalable in-database analytics currently in incubator at Apache. There are other open source CRAN packages to deal with big data, such as biglm, bigpca, biganalytics, bigmemory or pbdR—but they are focused on specific issues rather than addressing the data science pipeline in general.
Big data analysis presents a lot of opportunities to extract hidden patterns when you are using the right algorithms and the underlying technology that will help to gather insights. Connecting new scales of data with familiar tools is a challenge, but tools like Aster R offer a way to combine the beauty and elegance of the R language within a distributed environment to allow processing data at scale.