Skip to Content
Mastering Spark with R
book

Mastering Spark with R

by Javier Luraschi, Kevin Kuo, Edgar Ruiz
October 2019
Beginner to intermediate
293 pages
6h 55m
English
O'Reilly Media, Inc.
Content preview from Mastering Spark with R

Chapter 8. Data

Has it occurred to you that she might not have been a reliable source of information?

—Jon Snow

With the knowledge acquired in previous chapters, you are now equipped to start doing analysis and modeling at scale! So far, however, we haven’t really explained much about how to read data into Spark. We’ve explored how to use copy_to() to upload small datasets or functions like spark_read_csv() or spark_write_csv() without explaining in detail how and why.

So, you are about to learn how to read and write data using Spark. And, while this is important on its own, this chapter will also introduce you to the data lake—a repository of data stored in its natural or raw format that provides various benefits over existing storage architectures. For instance, you can easily integrate data from external systems without transforming it into a common format and without assuming those sources are as reliable as your internal data sources.

In addition, we will also discuss how to extend Spark’s capabilities to work with data not accessible out of the box and make several recommendations focused on improving performance for reading and writing data. Reading large datasets often requires you to fine-tune your Spark cluster configuration, but that’s the topic of Chapter 9.

Overview

In Chapter 1, you learned that beyond big data and big compute, you can also use Spark to improve velocity, variety, and veracity in data tasks. While you can use the learnings of this chapter for any ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Advanced Machine Learning with R

Advanced Machine Learning with R

Cory Lesmeister, Dr. Sunil Kumar Chinnamgari
Advanced R

Advanced R

Hadley Wickham
Regression Analysis with R

Regression Analysis with R

Giuseppe Ciaburro, Pierre Paquay, Manoj Kumar, Shaikh Salamatullah

Publisher Resources

ISBN: 9781492046363Errata Page