Introducing Spark andKafka | 143
6.2 WORKING WITH KAFKA
6.2.1 What is Apache Kafka
We use Apache Kafka when it comes to enabling communication between producers and consumers
using message-based topics. Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe mes-
saging system. Basically, it designs a platform for high-end new generation distributed applications.
In addition, it allows a large number of permanent or ad hoc consumers. One of the best
features of Kafka is it is highly available and resilient to node failures and supports automatic
recovery. This feature makes Apache Kafka ideal for communication and integration between
components of large-scale data systems in real-world data systems.
Points to Ponder
The name Spark SQL is obtained so because it works with your data in a similar fashion to SQL.
Spark SQL is much faster than normal Hive query execution as its in memory computation nature.
Spark Streaming divides a data stream into batches of X seconds called Dstreams, which inter-
nally is a sequence of RDDs. Spark Streaming is an extension of the core SparkAPI that enables
scalable, high-throughput, fault-tolerant stream processing of live data streams.
Spark Streaming processed and cleansed the generated data from different sources. The process-
ing is executed as per the requirements.
PySpark is a very good Python API for development in Spark to execute many complex programs
in a flexible manner.
Python is dynamically typed, so RDDs can hold objects of multiple types. This is the important
difference with Java.
Datasets in Apache Spark are an extension of DataFrame API which provides type-safe,
object-oriented programming interface.
Datasets can also efficiently process structured and unstructured data. It represents data in the
form of JVM objects of row or a collection of row object, which is represented in tabular forms
through encoders. It provides compile-time type safety.
M06 Big Data Simplified XXXX 01.indd 143 5/17/2019 2:49:17 PM

Get Big Data Simplified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.