Chapter 1. Introduction
Until recently, big data systems have been batch oriented, where data is captured in distributed filesystems or databases and then processed in batches or studied interactively, as in data warehousing scenarios. Now, it is a competitive disadvantage to rely exclusively on batch-mode processing, where data arrives without immediate extraction of valuable information.
Hence, big data systems are evolving to be more stream oriented, where data is processed as it arrives, leading to so-called fast data systems that ingest and process continuous, potentially infinite data streams.
Ideally, such systems still support batch-mode and interactive processing, because traditional uses, such as data warehousing, haven’t gone away. In many cases, we can rework batch-mode analytics to use the same streaming infrastructure, where we treat our batch data sets as finite streams.
This is an example of another general trend, the desire to reduce operational overhead and maximize resource utilization across the organization by replacing lots of small, special-purpose clusters with a few large, general-purpose clusters, managed using systems like Kubernetes or Mesos. While isolation of some systems and workloads is still desirable for performance or security reasons, most applications and development teams benefit from the ecosystems around larger clusters, such as centralized logging and monitoring, universal CI/CD (continuous integration/continuous delivery) pipelines, and the option to scale the applications up and down on demand.
In this report, I’ll make the following core points:
-
Fast data architectures need a stream-oriented data backplane for capturing incoming data and serving it to consumers. Today, Kafka is the most popular choice for this backplane, but alternatives exist, too.
-
Stream processing applications are “always on,” which means they require greater resiliency, availability, and dynamic scalability than their batch-oriented predecessors. The microservices community has developed techniques for meeting these requirements. Hence, streaming systems need to look more like microservices.
-
If we extract and exploit information more quickly, we need a more integrated environment between our microservices and stream processors, requiring fast data architectures that are flexible enough to support heterogeneous workloads. This requirement dovetails with the trend toward large, heterogeneous clusters.
I’ll finish this chapter with a review of the history of big data and batch processing, especially the classic Hadoop architecture for big data. In subsequent chapters, I’ll discuss how the changing landscape has fueled the emergence of stream-oriented, fast data architectures and explore a representative example architecture. I’ll describe the requirements these architectures must support and the characteristics of specific tools available today. I’ll finish the report with a look at an example IoT (Internet of Things) application that leverages machine learning.
A Brief History of Big Data
The emergence of the internet in the mid-1990s induced the creation of data sets of unprecedented size. Existing tools were neither scalable enough for these data sets nor cost-effective, forcing the creation of new tools and techniques. The “always on” nature of the internet also raised the bar for availability and reliability. The big data ecosystem emerged in response to these pressures.
At its core, a big data architecture requires three components:
- Storage
-
A scalable and available storage mechanism, such as a distributed filesystem or database
- Compute
-
A distributed compute engine for processing and querying the data at scale
- Control plane
-
Tools for managing system resources and services
Other components layer on top of this core. Big data systems come in two general forms: databases, especially the NoSQL variety, that integrate and encapsulate these components into a database system, and more general environments like Hadoop, where these components are more exposed, providing greater flexibility, with the trade-off of requiring more effort to use and administer.
In 2007, the now-famous Dynamo paper accelerated interest in NoSQL databases, leading to a “Cambrian explosion” of databases that offered a wide variety of persistence models, such as document storage (XML or JSON), key/value storage, and others. The CAP theorem emerged as a way of understanding the trade-offs between data consistency and availability guarantees in distributed systems when a network partition occurs. For the always-on internet, it often made sense to accept eventual consistency in exchange for greater availability. As in the original Cambrian explosion of life, many of these NoSQL databases have fallen by the wayside, leaving behind a small number of databases now in widespread use.
In recent years, SQL as a query language has made a comeback as people have reacquainted themselves with its benefits, including conciseness, widespread familiarity, and the performance of mature query optimization techniques.
But SQL can’t do everything. For many tasks, such as data cleansing during ETL (extract, transform, and load) processes and complex event processing, a more flexible model was needed. Also, not all data fits a well-defined schema. Hadoop emerged as the most popular open-source suite of tools for general-purpose data processing at scale.
Why did we start with batch-mode systems instead of streaming systems? I think you’ll see as we go that streaming systems are much harder to build. When the internet’s pioneers were struggling to gain control of their ballooning data sets, building batch-mode architectures was the easiest problem to solve, and it served us well for a long time.
Batch-Mode Architecture
Figure 1-1 illustrates the “classic” Hadoop architecture for batch-mode analytics and data warehousing, focusing on the aspects that are important for our discussion.
In this figure, logical subsystem boundaries are indicated by dashed rectangles. They are clusters that span physical machines, although HDFS and YARN (Yet Another Resource Negotiator) services share the same machines to benefit from data locality when jobs run.
Data is ingested into the persistence tier, into one or more of the following: HDFS (Hadoop Distributed File System), AWS S3, SQL and NoSQL databases, search engines like Elasticsearch, and other systems. Usually this is done using special-purpose services such as Flume for log aggregation and Sqoop for interoperating with databases.
Later, analysis jobs written in Hadoop MapReduce, Spark, or other tools are submitted to the Resource Manager for YARN, which decomposes each job into tasks that are run on the worker nodes, managed by Node Managers. Even for interactive tools like Hive and Spark SQL, the same job submission process is used when the actual queries are executed as jobs.
Table 1-1 gives an idea of the capabilities of such batch-mode systems.
Metric | Sizes and units |
---|---|
Data sizes per job |
TB to PB |
Time between data arrival and processing |
Minutes to hours |
Job execution times |
Seconds to hours |
So, the newly arrived data waits in the persistence tier until the next batch job starts to process it.
In a way, Hadoop is a database deconstructed, where we have explicit separation between storage, compute, and management of resources and compute processes. In a regular database, these subsystems are hidden inside the “black box.” The separation gives us more flexibility and reduces cost, but requires us to do more work for administration.
Get Fast Data Architectures for Streaming Applications, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.