From big data to fast data
Designing application architectures for real-time decisions.
Enterprise data needs change constantly but at inconsistent rates, and in recent years change has come at an increasing clip. Tools once considered useful for big data applications are not longer sufficient. When batch operations predominated, Hadoop could handle most of an organization’s needs. Development in other IT areas (think IoT, geolocation, etc.) have changed the way data is collected, stored, distributed, processed and analyzed. Real-time decision needs complicate this scenario and new tools and architectures are needed to handle these challenges efficiently.
Think of the 3 V’s of data: volume, velocity, and variety. For a while big data emphasized data volume; now fast data applications mean velocity and variety are key. Two tendencies have emerged from this evolution: first, the variety and velocity of data that enterprise needs for decision making continues to grow. This data includes not only transactional information, but also business data, IoT metrics, operational information, and application logs. Second, modern enterprise needs to make those decisions in real time, based on all that collected data. This need is best clarified by looking at how modern shopping websites work.
Websites are about what we can offer to a particular client at a precise moment. We need to know the “heat” zones of the page — the most clicked items — and why there are cold zones. If a person is viewing section A, the system must offer section B, because data shows many customers moved from section A to section B. Again, we have two challenges: collect all this data in an online way, clean it, process it, and analyze it in an online way (there is no more overnight processing). Second, based on this information we need to make changes to the web page immediately, so the analysis must be accurate and tied dynamically to the customer for a precise moment. It is indeed a real-time world.
Beyond Hadoop: designing architectures for fast data
Big data used to mean Hadoop and NoSQL databases. Hadoop was born in the “batch mode” and “off-line” processing era, when data was captured, stored and processed periodically with batch jobs. In those years, search engines worked by having data gathered by web crawlers and then processed overnight to offer updated results the next day. Big data was focused on data capture and off-line batch mode operation. As our earlier website example shows, modern “fast data” reduces the time between data arriving and data value extraction. Real-time event processing is the opposite of batch data. In real-time fast data architectures individual events are processed as they arrive, and we talk about processing times of milliseconds, even microseconds.
Building fast data architectures that can do this kind of millisecond processing means using systems and approaches that deliver timely and cost-efficient data processing focused on developer productivity. Any successful a fast data architecture must satisfy these high-level requirements:
- Performant and reliable data acquisition or ingestion
- Flexible storage and querying
- Sophisticated analysis tools
It’s also important to mention that architecture components must comply with the R’s: reactive (scaling up and down based on demand), resilient (against failures in all the distributed systems), and responsive (even if the failures limit the ability to deliver services). Related to the R’s is the fact that fast data processing has also changed data center operations, so data center management has become a key tool in fast data architecture that meets the demands of real-time decision making.
Modern fast data systems are composed of four transformation stages that provide the following data capabilities:
- Processing and analysis
- Presentation and visualization
Let’s look at each of these stages and their tools in turn.
Data acquisition: pipeline for performance
In this step, data enters the system from diverse sources. The key focus of this stage is performance, as this step impacts of how much data the whole system can receive at any given point in time. The steps of the data acquisition process are show in the figure below.
Ensuring a high performing data acquisition component means paying attention to some key principles:
Data transfer should be asynchronous and avoid back pressure.
When a synchronous system becomes asynchronous we have to consider two factors: the tokens subsystem and the transaction subsystem. This sounds easy, but there are two ways to implement an asynchronous system: using a file-feed transfer or using a Message-Oriented-Middleware (MOM), an infrastructure that supports messaging in distributed systems. Imagine data is generated faster than it is consumed; in distributed systems design this is called back pressure. One main objective of any MOM system and distributed queues is to deal with the back pressure.
Parsing can be expensive so parallelize when possible
Using the right parser can be a crucial factor because conversion between data formats (binary, XML, text) should be optimal for the APIs implemented by the MOM. The most expensive step of data acquisition is data transformation — the step where time and resources are consumed, so If you can use a parallel process to transform your data before processing, it is most valuable here. Another time saver is to filter invalid data as early as possible. Processing must work with valid data sets, so clean and remove duplicates before transforming data.
For this stage you should consider streaming APIs and messaging solutions like:
- Apache Kafka – open-source stream processing platform
- Akka Streams – open-source stream processing based on Akka
- Amazon Kinesis – Amazon data stream processing solution
- ActiveMQ – open-source message broker with a JMS client in Java
- RabbitMQ – open-source message broker with a JMS client in Erlang
- JBoss AMQ – lightweight MOM developed by JBoss
- Oracle Tuxedo – middleware message platform by Oracle
- Sonic MQ – messaging system platform by Sonic
For handling many of these key principles of data acquisition, the winner is Apache Kafka because it’s open source, focused on high-throughput, low-latency, and handles real-time data feeds.
Data storage: flexible experimentation leads to solutions
There are a lot of points of view for designing this layer, but all should consider two perspectives: logical (i.e. the model) and physical data storage. The key focus for this stage is “experimentation” and flexibility.
Every problem has its solution but don’t reinvent the wheel.
The term database is huge, always keep in mind that different solutions have different capabilities: some are better with reading, some for inserting, and some at updating.
Another important consideration — not all data is relational because not all data is text. Remember efficiency at every step makes all the processes faster and improves resource consumption for things like network transfer and disk space.
I configure then I am, and when configured, reconfigure
There is no more default, “out-of-the-box” configuration. The same solution can have two distinct configurations making the same tool behave totally orthogonally. To be competitive, a storage system must have the ability to modify two key settings: replication level and consistency level. The art of the solution is trying new configurations till you get the desired behavior.
Does relational solve your problem and is normalization still an option?
Think big, out of the box. In the RDBM systems you have to adapt your problem to the relational model, so you have to find your entities, relationships, keys and foreign keys. In the fast data era, data systems are modeled based on use cases solved. Another question, that seemed absurd in the past but now makes sense, is deciding between normalizing data and performance. The solution you choose has a direct impact on performance.
For this stage consider distributed database storage solutions like:
- Apache Cassandra – distributed NoSQL DBMS
- Couchbase – NoSQL document-oriented database
- Amazon DynamoDB – fully managed proprietary NoSQL database
- Apache Hive – data warehouse built on Apache Hadoop
- Redis – distributed in-memory key-value store
- Riak – distributed NoSQL key-value data store
- Neo4J – graph database management system
- MariaDB – with Galera form a replication cluster based on MySQL
- MongoDB – cross-platform document-oriented database
- MemSQL – distributed in-memory SQL RDBMS
For handling many of key principles of data storage just explained, the most balanced option is Apache Cassandra. It is open source, distributed, NoSQL, and designed to handle large data across many commodity servers with no single point of failure.
Data processing: combining tools and approaches
Years ago, there was discussion about whether big data systems should be (modern) stream processing or (traditional) batch processing. Today we know the correct answer for fast data is that most systems must be hybrid — both batch and stream at the same time. The type of processing is now defined by the process itself, not by the tool. The key focus of this stage is “combination.”
Segment your processing and consider micro batching.
To be competitive fast data tools must offer exceptional batch processing, excellent stream processing, or both. The point is to use the right tool for the right task of your processing. Don’t use batch tools when online processing is needed, and don’t try to move all the traditional ETL processes to modern streaming architectures. Some tools divide data into smaller pieces, where every chunk is processed independently in individual tasks. Some frameworks make batching in every chunk, which is called micro batching, and another model is to process it like a single stream of data.
Considering in-memory or in-disk and data locality
There are many traditional disk-based solutions, and there are a lot of modern in-memory solutions. You have to evaluate if your system has the infrastructure necessary for in-memory processing. Not all processes should be in-memory; combine resources to get optimal results. Above all, never lose sight of data proximity — the best performance happens when data is available locally. When data is not near there is transfer cost. The art of this stage is evaluating between data transferring or replication, and among processing locations.
For this stage, you should consider data processing solutions like:
- Apache Spark – engine for large-scale data processing
- Apache Flink – open-source stream processing framework
- Apache Storm – open-source distributed realtime computation system
- Apache Beam – open-source, unified model for batch and streaming data
- Tensorflow – open-source library for machine intelligence
Visualization communicates data or information by encoding it as visual objects in graphs, to clearly and efficiently get information to users. This stage is not easy; it’s both an art and a science.
Don’t process and minimize calculations
This step should avoid processing. Some reports will need runtime calculations, but always work in the highest level of data possible. data should always be in summarized output tables, with groupings that could be temporal (by time period) or categorical (based on business classes).
Parallelize and performance
Performance is better if reports are parallelized. Also, the use of caching always has a positive impact on the performance of the visualization layer.
For this layer you should consider visualization solutions in these three categories:
Data center management: the fast data infrastructure
Even after you have a solid architecture designed to handle the four data transformation stages of, it’s important to recognize that modern fast data processing has also changed data center operations.
From scale-up to scale-out and the open-source reign
Business is moving from specialized, proprietary, and expensive supercomputers to deployment in clusters made of commodity machines connected with a low-cost network. Now, the total cost of ownership (TCO) determines the destiny, quality, and size of a data center. One common practice is to create a dedicated cluster for each technology: a Kafka cluster, a Spark cluster, a Cassandra cluster, etc., because the overall TCO tends to increase.
The tendency is to adopt open source and avoid two dependencies: vendor lock-in and external entity support. Transparency is ensured through community-defined groups, such as the Apache Software Foundation or the Eclipse Foundation, which provide guidelines, infrastructure, and tooling for sustainable and fair technology development.
Data store diversification, data gravity, and data locality
The NoSQL proliferation has the consequence that we now have to deal with data store synchronization among data centers. Data gravity mean considering the overall cost associated with data transfer, in terms of volume and tooling. Data locality is the idea of moving computation to the data rather than data to the computation.
DevOps means best practices for collaboration between software development and operations. To avoid reworking, the developer’s team should have the same for local testing as the used in production environments.
For this layer you should consider the following management technologies:
- Apache Mesos – open-source project to manage computer clusters
- Mesosphere DC/OS – platform for automating rollout and production for containers and data services
- Amazon Web Services – cloud computing services
- Docker – open platform to build, ship, and run distributed applications
- Kubernetes – open-source system for automating, deploying, scaling and managing containerized applications
- Spinnaker – open-source, multi-cloud continuous delivery platform for releasing software changes
Paying attention to the principles outlined above as you consider your fast data architecture and its tools will get you closer to unlocking real value in your data. Managing and using that data effectively — moving it through the stages of acquisition, storage, processing and analysis, and visualization will get you actionable insights and make you competitive in this real-time world.
This post is a collaboration between O’Reilly and Mesosphere. See our statement of editorial independence.