Instead of loading all the data into RAM, as R does, databases query data from the hard disk. This can allow a subset of a very large dataset to be defined and read into R quickly, without having to load it first. R can connect to databases in a number of ways, which are briefly touched on below. The subject of databases is a large area undergoing rapid evolution. Rather than aiming at comprehensive coverage, we will provide pointers to developments that enable efficient access to a wide range of database types. An up-to-date history of R’s interfaces to databases can be found in the README of the DBI package, which provides a common interface and set of classes for driver packages (such as RSQLite).
RODBC is a veteran package for querying external databases from within
R, using the Open Database Connectivity (ODBC) API. The functionality of
RODBC is described in the package’s vignette (see
and today its main use is to provide an R interface to SQL Server
databases, which lack a DBI interface.
The DBI package is a unified framework for accessing databases
that allows for other drivers to be added as modular packages. Thus new
packages that build on DBI can be seen partly as a replacements of
RODBC (RMySQL, RPostgreSQL, and RSQLite) (see
vignette("backend") for more on how DBI drivers work). Because the
DBI syntax applies to a wide range of database types, we use it here
with a worked example.
Imagine you have access to a database that contains the
# Connect to a database driver
# Also username & password arguments
"SELECT * FROM `ghg_ems` WHERE (`Country` != 'World')"
# extract first 6 row
The preceding code chunk shows how the function
dbConnect connects to an
external database—in this case, a MySQL database. The
password arguments are used to establish the connection. Next, we query
which tables are available with
dbListTables, query the database
(without yet extracting the results to R) with
finally, load the results into R with
Be sure never to release your password by entering it directly into the
command. Instead, we recommend saving sensitive information such as
database passwords and API keys in .Renviron.
Assuming you had saved your password as the environment variable
you could enter
pwd = Sys.getenv("PSWRD") to minimize the risk of
exposing your password through accidentally releasing the code or your
Recently there has been a shift to the noSQL approach to storing large datasets. This is illustrated by the emergence and uptake of software such as MongoDB and Apache Cassandra that have R interfaces via packages mongolite and RJDBC, which can connect to Apache Cassandra data stores and any source compliant with the Java Database Connectivity (JDBC) API.
MonetDB is a recent alternative to relational and noSQL approaches that offers substantial efficiency advantages for handling large datasets (Kersten et al. 2011). A tutorial on the MonetDB website provides an excellent introduction to handling databases from within R.
There are many wider considerations in relation to databases that we will not cover here: who will manage and maintain the database? How will it be backed up locally (local copies should be stored to reduce reliance on the network)? What is the appropriate database for your project? These issues can have major effects on efficiency, especially on large, data-intensive projects. However, we will not cover them here because it is a fast-moving field. Instead, we direct the interested reader toward resources on the subject, including:
The website for sparklyr, a recently created package for efficiently interfacing with the Apache Spark stack.
db-engines.com/en/, a website comparing the relative merits of different databases.
databases vignette from the dplyr package.
Getting started with MongoDB in R, an introductory vignette on nonrelational databases and map reduce from the mongolite package.
To access a database in R via dplyr, you must use one of the
functions to create a source. Continuing with the SQLite example previously given,
you would create a
tbl object that can be queried by dplyr as
ghg_tbl object can then be queried in a similar way as a standard
data frame. For example, suppose we wished to filter by
we use the
filter() function as before:
In this code, dplyr has actually generated the necessary SQL
command, which can be examined using
explain(rm_world). When working
with databases, dplyr uses lazy evaluation: the data is only fetched
at the last moment when it’s needed. The SQL command associated with
rm_world hasn’t yet been executed; this is why
doesn’t work. By using lazy evaluation, dplyr is more efficient at
handling large data structures because it avoids unnecessary copying. When
you want your SQL command to be executed, use
Follow the worked example here to create and query a database on land prices in the UK using dplyr as a frontend to an SQLite database. The first stage is to read in the data:
# See help("land_df", package="efficient") for details
The next stage is to create an SQLite database to hold the data:
# install.packages("RSQLite") # Requires RSQLite package
What class is the new object
Why did we use the
From the preceding code, we can see that we have created a
tbl. This can be
accessed using dplyr in the same way as any data frame can. Now we can
query the data. You can use SQL code to query the database directly or
use standard dplyr verbs on the table.
# Method 1: using sql
'SELECT "price", "postcode", "old/new" FROM land_df'
#> Source: query [?? x 3]
#> Database: sqlite 3.8.6 [land.sqlite3]
#> price postcode `old/new`
#> <int> <chr> <chr>
#> 1 84000 CW9 5EU N
#> 2 123500 TR13 8JH N
#> 3 217950 PL33 9DL N
#> 4 147000 EX39 5XT N
#> # ... with more rows
How would you perform the same query using
select()? Try it to see if
you get the same result (hint: use backticks for the
#> Source: query [?? x 3] #> Database: sqlite 3.8.6 [land.sqlite3] #> #> price postcode `old/new` #> <int> <chr> <chr> #> 1 84000 CW9 5EU N #> 2 123500 TR13 8JH N #> 3 217950 PL33 9DL N #> 4 147000 EX39 5XT N #> # ... with more rows
Wickham, Hadley. 2014b. “Tidy Data.” The Journal of Statistical Software 14 (5).
Codd, E. F. 1979. “Extending the database relational model to capture more meaning.” ACM Transactions on Database Systems 4 (4): 397–434. doi:10.1145/320107.320109.
Spector, Phil. 2008. Data Manipulation with R. Springer Science & Business Media.
Sanchez, Gaston. 2013. “Handling and Processing Strings in R.” Trowchez Editions. http://bit.ly/handlingstringsR.
Grolemund, G., and H. Wickham. 2016. R for Data Science. O’Reilly Media.
Wickham, Hadley. 2010. “Stringr: Modern, Consistent String Processing.” The R Journal 2 (2): 38–40.
Kersten, Martin L, Stratos Idreos, Stefan Manegold, Erietta Liarou, and others. 2011. “The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds.” PVLDB Challenges and Visions 3.