O'Reilly logo

Efficient data processing with R by Robin Lovelace, Colin Gillespie

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Working with Databases

Instead of loading all the data into RAM, as R does, databases query data from the hard disk. This can allow a subset of a very large dataset to be defined and read into R quickly, without having to load it first. R can connect to databases in a number of ways, which are briefly touched on below. The subject of databases is a large area undergoing rapid evolution. Rather than aiming at comprehensive coverage, we will provide pointers to developments that enable efficient access to a wide range of database types. An up-to-date history of R’s interfaces to databases can be found in the README of the DBI package, which provides a common interface and set of classes for driver packages (such as RSQLite).

RODBC is a veteran package for querying external databases from within R, using the Open Database Connectivity (ODBC) API. The functionality of RODBC is described in the package’s vignette (see vignette("RODBC")), and today its main use is to provide an R interface to SQL Server databases, which lack a DBI interface.

The DBI package is a unified framework for accessing databases that allows for other drivers to be added as modular packages. Thus new packages that build on DBI can be seen partly as a replacements of RODBC (RMySQL, RPostgreSQL, and RSQLite) (see vignette("backend") for more on how DBI drivers work). Because the DBI syntax applies to a wide range of database types, we use it here with a worked example.

Imagine you have access to a database that contains the ghg_ems dataset.

# Connect to a database driver
library("RSQLite")
con = dbConnect(SQLite(), dbname = ghg_db) # Also username & password arguments
dbListTables(con)
rs = dbSendQuery(con, "SELECT * FROM `ghg_ems` WHERE (`Country` != 'World')")
df_head = dbFetch(rs, n = 6) # extract first 6 row

The preceding code chunk shows how the function dbConnect connects to an external database—in this case, a MySQL database. The username and password arguments are used to establish the connection. Next, we query which tables are available with dbListTables, query the database (without yet extracting the results to R) with dbSendQuery, and, finally, load the results into R with dbFetch.

Tip

Be sure never to release your password by entering it directly into the command. Instead, we recommend saving sensitive information such as database passwords and API keys in .Renviron. Assuming you had saved your password as the environment variable PSWRD, you could enter pwd = Sys.getenv("PSWRD") to minimize the risk of exposing your password through accidentally releasing the code or your session history.

Recently there has been a shift to the noSQL approach to storing large datasets. This is illustrated by the emergence and uptake of software such as MongoDB and Apache Cassandra that have R interfaces via packages mongolite and RJDBC, which can connect to Apache Cassandra data stores and any source compliant with the Java Database Connectivity (JDBC) API.

MonetDB is a recent alternative to relational and noSQL approaches that offers substantial efficiency advantages for handling large datasets (Kersten et al. 2011). A tutorial on the MonetDB website provides an excellent introduction to handling databases from within R.

There are many wider considerations in relation to databases that we will not cover here: who will manage and maintain the database? How will it be backed up locally (local copies should be stored to reduce reliance on the network)? What is the appropriate database for your project? These issues can have major effects on efficiency, especially on large, data-intensive projects. However, we will not cover them here because it is a fast-moving field. Instead, we direct the interested reader toward resources on the subject, including:

  • The website for sparklyr, a recently created package for efficiently interfacing with the Apache Spark stack.

  • db-engines.com/en/, a website comparing the relative merits of different databases.

  • The databases vignette from the dplyr package.

  • Getting started with MongoDB in R, an introductory vignette on nonrelational databases and map reduce from the mongolite package.

Databases and dplyr

To access a database in R via dplyr, you must use one of the src_*() functions to create a source. Continuing with the SQLite example previously given, you would create a tbl object that can be queried by dplyr as follows:

library("dplyr")
ghg_db = src_sqlite(ghg_db)
ghg_tbl = tbl(ghg_db, "ghg_ems")

The ghg_tbl object can then be queried in a similar way as a standard data frame. For example, suppose we wished to filter by Country. Then we use the filter() function as before:

rm_world = ghg_tbl %>%
  filter(Country != "World")

In this code, dplyr has actually generated the necessary SQL command, which can be examined using explain(rm_world). When working with databases, dplyr uses lazy evaluation: the data is only fetched at the last moment when it’s needed. The SQL command associated with rm_world hasn’t yet been executed; this is why tail(rm_world) doesn’t work. By using lazy evaluation, dplyr is more efficient at handling large data structures because it avoids unnecessary copying. When you want your SQL command to be executed, use collect(rm_world).

The final stage when working with databases in R is to disconnect:

dbDisconnect(conn = con)

Exercises

Follow the worked example here to create and query a database on land prices in the UK using dplyr as a frontend to an SQLite database. The first stage is to read in the data:

# See help("land_df", package="efficient") for details
data(land_df, package = "efficient")

The next stage is to create an SQLite database to hold the data:

# install.packages("RSQLite") # Requires RSQLite package
my_db = src_sqlite("land.sqlite3", create = TRUE)
land_sqlite = copy_to(my_db, land_df, indexes = list("postcode", "price"))
  1. What class is the new object land_sqlite?

  2. Why did we use the indexes argument?

    From the preceding code, we can see that we have created a tbl. This can be accessed using dplyr in the same way as any data frame can. Now we can query the data. You can use SQL code to query the database directly or use standard dplyr verbs on the table.

    # Method 1: using sql
    tbl(my_db, sql('SELECT "price", "postcode", "old/new"  FROM land_df'))
    #> Source:   query [?? x 3]
    #> Database: sqlite 3.8.6 [land.sqlite3]
    #>
    #>    price postcode `old/new`
    #>    <int>    <chr>     <chr>
    #> 1  84000  CW9 5EU         N
    #> 2 123500 TR13 8JH         N
    #> 3 217950 PL33 9DL         N
    #> 4 147000 EX39 5XT         N
    #> # ... with more rows
  3. How would you perform the same query using select()? Try it to see if you get the same result (hint: use backticks for the old/new variable name).

    #> Source:   query [?? x 3]
    #> Database: sqlite 3.8.6 [land.sqlite3]
    #>
    #>    price postcode `old/new`
    #>    <int>    <chr>     <chr>
    #> 1  84000  CW9 5EU         N
    #> 2 123500 TR13 8JH         N
    #> 3 217950 PL33 9DL         N
    #> 4 147000 EX39 5XT         N
    #> # ... with more rows

References

Wickham, Hadley. 2014b. “Tidy Data.” The Journal of Statistical Software 14 (5).

Codd, E. F. 1979. “Extending the database relational model to capture more meaning.” ACM Transactions on Database Systems 4 (4): 397–434. doi:10.1145/320107.320109.

Spector, Phil. 2008. Data Manipulation with R. Springer Science & Business Media.

Sanchez, Gaston. 2013. “Handling and Processing Strings in R.” Trowchez Editions. http://bit.ly/handlingstringsR.

Grolemund, G., and H. Wickham. 2016. R for Data Science. O’Reilly Media.

Wickham, Hadley. 2010. “Stringr: Modern, Consistent String Processing.” The R Journal 2 (2): 38–40.

Kersten, Martin L, Stratos Idreos, Stefan Manegold, Erietta Liarou, and others. 2011. “The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds.” PVLDB Challenges and Visions 3.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required