Chapter 4.  Unified Data Access

Data integration from disparate data sources had always been a daunting feat. The three V's of big data and ever-shrinking processing time frames have made the task even more challenging. Delivering a clear view of well-curated data in near real time is extremely important for business. However, real-time curated data along with the ability to perform different operations such as ETL, ad hoc querying, and machine learning in a unified fashion is what is emerging as a key business differentiator.

Apache Spark was created to offer a single general-purpose engine that can process data from a variety of data sources and support large-scale data processing for various different operations. Spark enables developers to ...

Get Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.