Chapter 12. Data Science and R
12.0 Introduction
Data science is a relatively new discipline that first came to the attention of many with a 2010 article by O’Reilly’s Mike Loukides. While there are many definitions in the field, Loukides distills his detailed observation of and participation in data science into this definition:
A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.
One of the main open source ecosystems for data science software is at Apache and includes Hadoop (which includes the Hadoop Distributed File System [HDFS], Hadoop MapReduce,1 the Ozone object store, and the YARN scheduler), the Cassandra distributed database, and the Spark compute engine. Read the Modules and Related projects sections of the Hadoop page for a current list.
What’s interesting here is that a great deal of this infrastructure, which is taken for granted by data scientists, is written in Java and Scala (a JVM language). Much of the rest is written in Python, a language that complements Java. Many users see only the Python side of things and don’t realize that Java is behind some of the infrastructure.
Data science (DS) problems can involve a lot of setup, so we’ll give only one example from traditional DS, using the Spark framework. Spark is written in Scala, so it can be used directly by Java code.
In the rest of the chapter I’ll ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access