Over the past few years, Spark has quickly grown from a research project to probably the most active open source project in parallel computing today. Spark started with a simple idea: accelerating distributed computing by leveraging the increasingly larger memory banks provided in modern computing nodes.
Today, as Apache Spark reaches the 2.0 milestone, it has become a rich framework where multiple techniques such as SQL, Machine Learning, Graph Analytics, and Streaming Computing can be combined together to solve complex data processing problems for a wide range of industries and business domains. Moreover, Spark connectors allow data engineers and data scientists to transfer data between Spark and a number of very diverse computing and storage systems, both via batch as well as real-time data interfaces.
Spark is quickly becoming a fundamental component in any modern data processing pipeline for startups and enterprises alike, thanks to its unique multi-language programming paradigm and the wide availability of connectors to other systems both for batch and real-time computing.
The evolution of Spark
In the early days, Spark programming revolved around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. The original Spark core API did not always feel natural for the larger population of data analysts and data engineers, who worked mainly with SQL and statistical languages such as R.
Today, Spark provides higher level APIs for advanced analytics and data science, and supports five different languages, including SQL and R. What makes Spark quite special in the distributed computing arena is the fact that different techniques such as SQL queries and machine learning can be mixed and combined together, even within the same script.
By using Spark, data scientists and engineers do not have to switch to different environments and tools for data pre-processing, SQL queries or machine learning algorithms. This fact boosts the productivity of data professionals and delivers better and simpler data processing solutions.
SQL vs. Scala vs. Java vs. Python vs. R
According to the last kdnuggets poll, SQL is still one of the most popular languages for data processing and data science. On the other side, increasingly more data scientists and engineers prefer to work programmatically on data sets with languages such as Scala, Python, or R.
To harmonize these two camps, Spark has recently introduced the Apache Spark Datasets as a high-level table-like data abstraction. Datasets feel more natural when reasoning about analytics and machine learning tasks, and can be addressed both via SQL queries as well as programmatically via Java/Scala/Python/R APIs.
Datasets behave in a similar way to SQL tables and undergo more aggressive optimizations compared to the early Spark RDD API. Therefore, data engineers are no longer bound to a specific language or programming style in Spark because of performance reasons. This has a dramatic effect on performance, as these optimizations can significantly reduce the memory footprint and the computing time of Spark jobs.
SQL functions for machine learning
Following this line of thought, it makes sense to provide even more advanced analytics and machine learning algorithms as Datasets and SQL functions. Extending SQL with specialized functions for advanced analytics, machine learning, and data transformations has already been explored in both open and closed-source tools. Teradata’s Aster Analytics, Fuzzy Logix DB Lytix, Hivemall, and Spark ML are some projects hinting in this direction.
Fuzzy Logix provides more than 800 in-database advanced analytics functions, including many machine learning algorithms for clustering, forecasting, and classification. It supports many traditional EDW systems as well as Teradata Aster and Spark. Instead of moving the data elsewhere, this engine uses the underlying SQL engine to execute those analytics functions.
On the open source front, Hivemall is a project that aims to extend Hadoop Hive with a library of SQL User Defined Functions (UDF) to support algorithms such as regression, classification, clustering, recommendation, and feature engineering. Although the initial idea was developed for Hive, the project also has some initial support for Spark SQL.
Spark ML is an evolution of the original Spark MLlib module, which uses DataFrames and Datasets instead of Spark RDD objects to process machine learning functions. Most of the Spark modules such as Graph, Streaming and ML in the imminent 2.0 release indeed use Datasets as the default data abstraction because of the speed-up factor and the reduce-memory footprint provided by the Spark Catalyst optimization engine.
Aster Analytics, a Teradata multi-genre advanced analytics system, provides a number of pre-built analytics modules that include Graph, Text & Sentiment, Machine Learning, Path & Pattern, statistics, and other analytics techniques. Teradata is introducing an Aster Connector for Spark, to be released in Q3 this year. This connector will allow Aster users to execute machine learning and advanced analytics algorithms on Spark as “remote” Aster SQL functions. Aster’s approach to advanced analytics does not require knowledge of any other programming language aside from SQL—some in the Aster community feel there is still a barrier to entry with Spark for analytics jobs, citing its reliance on memory as the core unit of data and analytics processing, and its instability in being able to troubleshoot and implement a repeatable/reliable user experience. Therefore, this connector will enable the widest possible group of end users, beyond just the technically proficient data scientists, to deliver actionable insights across a variety of use cases and verticals.
Ben Lorica recently spoke with Teradata’s Sri Raghavan about the evolution of data analytics in Spark. They discussed why some feel there is still a high barrier for entry with Spark for data analytics, and some tools to help overcome this barrier. Listen to their conversation below:
This post and podcast is a collaboration between O'Reilly and Teradata. See our statement of editorial independence.