Chapter 7. Geo-Distributed Data Streams

For our final example of how to design stream-based systems, we focus on a specific requirement: geo-distributed replication of data streams. This capability is needed in a wide variety of sectors, including telecommunications, oil and gas exploration, retail, and banking, but we’ve chosen a transportation example—international container shipping—to show you how to plan the data flow for systems that require data to be replicated efficiently across distant locations.

For this example, we focus on how the design would work with MapR Streams because it has special capabilities that make it particularly well suited for this class of use cases. MapR Streams is distinctive in being able to:

  • Handle huge numbers of topics (hundreds of thousands or more with high throughput)

  • Organize a group of topics into a stream, which makes data management much easier since many topics can be managed together

  • Provide uni- and bi-directional replication easily and reliably across geo-distributed data centers

In our shipping example (or examples from any of the other sectors), many different processes in addition to the messaging could be taking place on the same cluster since MapR’s messaging feature is integrated into the data platform. But for simplicity and in order to keep our explanation focused on how the data streams are replicated to distant sites, we will just examine the messaging aspect of the architecture rather than all the analytics and ...

Get Streaming Architecture now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.