Chapter 4Data-Flow Management in Streaming Analysis

Chapter 3, “Service Configuration and Coordination,” introduces the concept and difficulties of maintaining a distributed state. One of the most common reasons to require this distributed state is the collection and processing of data in a scalable way.

Distributed data flows, which include processing and collection, have been around a long time. Generally, the systems designed to handle this task have been bespoke applications developed either in-house or through consulting agreements. More recently, the technologies used to implement these data flow systems has reached the point of common infrastructure. Data flow systems can be split into a separate service in much the same way that coordination and configuration can. They are now general enough in their interfaces and their assumptions that they can be used outside of their originally intended applications.

The earliest of these systems were arguably the queuing systems, such as ActiveMQ, which started to come onto the scene in the early 2000s. However, they were not really designed for high-throughput volumes (although many of them can now achieve fairly good performance) and tended to be very Java centric.

The next systems on the scene were those open-sourced by the large Internet companies such as Facebook. One of the most well-known systems of this generation was a tool called Scribe, which was released in 2008. It used an RPC-like mechanism to concentrate data from ...

Get Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.