Working with Databases
Instead of loading all the data into RAM, as R does, databases query data from the hard disk. This can allow a subset of a very large dataset to be defined and read into R quickly, without having to load it first. R can connect to databases in a number of ways, which are briefly touched on below. The subject of databases is a large area undergoing rapid evolution. Rather than aiming at comprehensive coverage, we will provide pointers to developments that enable efficient access to a wide range of database types. An up-to-date history of R’s interfaces to databases can be found in the README of the DBI package, which provides a common interface and set of classes for driver packages (such as RSQLite).
RODBC is a veteran package for querying external databases from within
R, using the Open Database Connectivity (ODBC) API. The functionality of
RODBC is described in the package’s vignette (see vignette("RODBC")
),
and today its main use is to provide an R interface to SQL Server
databases, which lack a DBI interface.
The DBI package is a unified framework for accessing databases
that allows for other drivers to be added as modular packages. Thus new
packages that build on DBI can be seen partly as a replacements of
RODBC (RMySQL, RPostgreSQL, and RSQLite) (see
vignette("backend")
for more on how DBI drivers work). Because the
DBI syntax applies to a wide range of database types, we use it here
with a worked example.
Imagine you have access to a database that contains the ghg_ems
dataset.
# Connect to a database driver
library
(
"RSQLite"
)
con
=
dbConnect
(
SQLite
(),
dbname
=
ghg_db
)
# Also username & password arguments
dbListTables
(
con
)
rs
=
dbSendQuery
(
con
,
"SELECT * FROM `ghg_ems` WHERE (`Country` != 'World')"
)
df_head
=
dbFetch
(
rs
,
n
=
6
)
# extract first 6 row
The preceding code chunk shows how the function dbConnect
connects to an
external database—in this case, a MySQL database. The username
and
password
arguments are used to establish the connection. Next, we query
which tables are available with dbListTables
, query the database
(without yet extracting the results to R) with dbSendQuery
, and,
finally, load the results into R with dbFetch
.
Tip
Be sure never to release your password by entering it directly into the
command. Instead, we recommend saving sensitive information such as
database passwords and API keys in .Renviron.
Assuming you had saved your password as the environment variable PSWRD
,
you could enter pwd = Sys.getenv("PSWRD")
to minimize the risk of
exposing your password through accidentally releasing the code or your
session history.
Recently there has been a shift to the noSQL approach to storing large datasets. This is illustrated by the emergence and uptake of software such as MongoDB and Apache Cassandra that have R interfaces via packages mongolite and RJDBC, which can connect to Apache Cassandra data stores and any source compliant with the Java Database Connectivity (JDBC) API.
MonetDB is a recent alternative to relational and noSQL approaches that offers substantial efficiency advantages for handling large datasets (Kersten et al. 2011). A tutorial on the MonetDB website provides an excellent introduction to handling databases from within R.
There are many wider considerations in relation to databases that we will not cover here: who will manage and maintain the database? How will it be backed up locally (local copies should be stored to reduce reliance on the network)? What is the appropriate database for your project? These issues can have major effects on efficiency, especially on large, data-intensive projects. However, we will not cover them here because it is a fast-moving field. Instead, we direct the interested reader toward resources on the subject, including:
-
The website for sparklyr, a recently created package for efficiently interfacing with the Apache Spark stack.
-
db-engines.com/en/, a website comparing the relative merits of different databases.
-
The
databases
vignette from the dplyr package. -
Getting started with MongoDB in R, an introductory vignette on nonrelational databases and map reduce from the mongolite package.
Databases and dplyr
To access a database in R via dplyr, you must use one of the src_*()
functions to create a source. Continuing with the SQLite example previously given,
you would create a tbl
object that can be queried by dplyr as
follows:
library
(
"dplyr"
)
ghg_db
=
src_sqlite
(
ghg_db
)
ghg_tbl
=
tbl
(
ghg_db
,
"ghg_ems"
)
The ghg_tbl
object can then be queried in a similar way as a standard
data frame. For example, suppose we wished to filter by Country
. Then
we use the filter()
function as before:
rm_world
=
ghg_tbl
%>%
filter
(
Country
!=
"World"
)
In this code, dplyr has actually generated the necessary SQL
command, which can be examined using explain(rm_world)
. When working
with databases, dplyr uses lazy evaluation: the data is only fetched
at the last moment when it’s needed. The SQL command associated with
rm_world
hasn’t yet been executed; this is why tail(rm_world)
doesn’t work. By using lazy evaluation, dplyr is more efficient at
handling large data structures because it avoids unnecessary copying. When
you want your SQL command to be executed, use collect(rm_world)
.
The final stage when working with databases in R is to disconnect:
dbDisconnect
(
conn
=
con
)
Exercises
Follow the worked example here to create and query a database on land prices in the UK using dplyr as a frontend to an SQLite database. The first stage is to read in the data:
# See help("land_df", package="efficient") for details
data
(
land_df
,
package
=
"efficient"
)
The next stage is to create an SQLite database to hold the data:
# install.packages("RSQLite") # Requires RSQLite package
my_db
=
src_sqlite
(
"land.sqlite3"
,
create
=
TRUE
)
land_sqlite
=
copy_to
(
my_db
,
land_df
,
indexes
=
list
(
"postcode"
,
"price"
))
-
What class is the new object
land_sqlite
? -
Why did we use the
indexes
argument?From the preceding code, we can see that we have created a
tbl
. This can be accessed using dplyr in the same way as any data frame can. Now we can query the data. You can use SQL code to query the database directly or use standard dplyr verbs on the table.# Method 1: using sql
tbl
(
my_db
,
sql
(
'SELECT "price", "postcode", "old/new" FROM land_df'
))
#> Source: query [?? x 3]
#> Database: sqlite 3.8.6 [land.sqlite3]
#>
#> price postcode `old/new`
#> <int> <chr> <chr>
#> 1 84000 CW9 5EU N
#> 2 123500 TR13 8JH N
#> 3 217950 PL33 9DL N
#> 4 147000 EX39 5XT N
#> # ... with more rows
-
How would you perform the same query using
select()
? Try it to see if you get the same result (hint: use backticks for theold/new
variable name).#> Source: query [?? x 3] #> Database: sqlite 3.8.6 [land.sqlite3] #> #> price postcode `old/new` #> <int> <chr> <chr> #> 1 84000 CW9 5EU N #> 2 123500 TR13 8JH N #> 3 217950 PL33 9DL N #> 4 147000 EX39 5XT N #> # ... with more rows
References
Wickham, Hadley. 2014b. “Tidy Data.” The Journal of Statistical Software 14 (5).
Codd, E. F. 1979. “Extending the database relational model to capture more meaning.” ACM Transactions on Database Systems 4 (4): 397–434. doi:10.1145/320107.320109.
Spector, Phil. 2008. Data Manipulation with R. Springer Science & Business Media.
Sanchez, Gaston. 2013. “Handling and Processing Strings in R.” Trowchez Editions. http://bit.ly/handlingstringsR.
Grolemund, G., and H. Wickham. 2016. R for Data Science. O’Reilly Media.
Wickham, Hadley. 2010. “Stringr: Modern, Consistent String Processing.” The R Journal 2 (2): 38–40.
Kersten, Martin L, Stratos Idreos, Stefan Manegold, Erietta Liarou, and others. 2011. “The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds.” PVLDB Challenges and Visions 3.
Get Efficient data processing with R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.