As more and more organizations need insights on streams of data as they’re produced, stream processing is quickly becoming standard in modern data architectures. There are many solutions for processing real-time streams but the majority of them target developers and assume proficiency in Java, Scala, or other programming languages.
This reliance on developers and data engineers is fine for sophisticated algorithms and complex systems, but the first step in stream processing is often simple filter, extract, and transform pipelines to prepare raw data for further analysis. Putting control of filter, extract, and transform into the hands of administrators and analysts increases the velocity of analysis and the flexibility of deployed systems. In addition, these preparation phases need to be embedded in multiple contexts, especially if your data pipelines support both streaming and batch computations over the same source streams. Writing the same data preparation pipeline in four different frameworks is expensive and increases long-term maintenance costs.
Stream processing: A primer
Stream processing is a method of analyzing a stream of events or records to derive a conclusion. The key differentiators of stream processing systems and batch processing systems, such as Apache Hadoop, is how data is organized and the latency from when data is produced to when computation over that data can take place. Stream processing focuses on providing low-latency access to streams of records and enables complex processing over those streams such as aggregation, joining, and modeling. Over the last 5 years, there has been an explosion of open-source stream processing systems. I’ll describe a few of these systems below as example stream processing contexts where you’d want to reuse data preparation pipelines.
Apache Storm was one of the first streaming systems to be open sourced. Storm is described as a “distributed realtime computation system.” Applications in Storm’s world are packaged into topologies, which are analogous to MapReduce jobs in Hadoop. Topologies operate over streams of tuples, Storm’s word for a record. Streams are sourced into a topology through a component called a spout. Spouts allow Storm to take input from external systems. The processing itself is done using components called bolts. A bolt can perform any operation such as filtering, aggregating, joining, or executing arbitrary functions.
Apache Spark started as a research project and has recently gained a lot of popularity in the data processing world. Spark is notable in that it was one of the first execution engines to support both batch processing and stream processing. Stream processing is done using a framework called Spark Streaming. One of the key differences between Spark Streaming and Storm is how streams are consumed by the application. In Storm, individual tuples can be processed one at a time whereas in Spark Streaming, a continuous stream of records is discretized into an abstraction called a DStream. A DStream is a sequence of record batches and those form the input into your streaming application logic. This approach is called micro-batching and it generally improves throughput at the cost of latency.
Apache Flink is similar to Spark in that it is a single framework that includes both batch processing and stream processing support. In Flink, the basic abstraction is the DataStream, which is an unbounded collection of records. Operations can be applied to individual records or windows of records depending on the desired outcome. Like Storm but unlike Spark, Flink supports record-at-a-time processing and Flink doesn’t require micro-batching.
All three of these stream processing systems need to source their streams from somewhere. There are multiple options for sourcing streams but increasingly the most popular is Apache Kafka. Kafka is a pub-sub messaging system implemented as a distributed commit log. Kafka is popular as a source and sink for data streams due to its scalability, durability, and easy-to-understand delivery guarantees. All three of the systems I described above have connectors to read and write to Kafka. You can also write consumer applications that read from Kafka directly.
Filtering, extraction, and transformation
In stream processing architectures it’s common to send raw streams of data to Kafka before consuming those streams in real-time or writing the data to long-term storage. In both cases, the raw stream needs to be prepared for analysis and storage. This data preparation is the first task of the stream consumer, be it a custom Kafka consumer, a Spark Streaming job, or a Storm topology. There are multiple ways that you can think about this data preparation phase, but we’ll focus on the three aspects of data preparation that are the most common: filtering, extraction, and transformation.
Filtering is the act of limiting what data should be forwarded to the next phase of a stream processing pipeline. You may filter out sensitive data that needs to be handled carefully or that has a limited audience. It’s also common to use filtering for data quality and schema matching. Lastly, filtering is often a special case of routing a raw stream into multiple streams for further analysis.
Raw streams of data often contain unstructured or semi-structured records or fields. Before structured data analysis like aggregation or model building can take place, you need to extract some structured fields from these raw records. It’s common to use regular expressions or libraries like Grok to do this parsing and extraction. Post-extraction it’s valuable to reassemble records using a serialization library that supports schemas such as Apache Avro. Those structured records create a new stream for downstream consumers.
After a stream has been transformed into structured records, there may still be the need to restructure the stream using projection or flattening operations. These operations are examples of transformations. The most common case for these kinds of transformation is to transform records with a particular source schema into records with a desired target schema. This can be used for merging multiple streams into a single stream with a common schema, or for writing a stream to long-term storage using a specific schema for offline analysis.
While all of these patterns can be implemented directly in code in the streaming systems we’ve talked about, we’d ideally like to re-use these data preperation actions in multiple contexts. It’s also valuable to be able to expand the audience for building these actions beyond your team of developers.
Data preparation for all
To solve the problem of creating reusable data preparation actions, Rocana Transform is a configuration-driven transformation library that can be embedded in any JVM-based stream processing or batch processing system such as Spark Streaming, Storm, Flink, or Apache MapReduce. I use the library in Kafka consumers to transform streams of data before they’re written to HDFS, Apache Solr, or other downstream systems. By making data transformation a configuration problem, admins, analysts, and other non-developers can transform streams of data without the need for custom code.
For those who know Java, Rocana Transform supports custom actions written with Java code. I’ve seen custom actions that look up IP addresses in a geolocation database to enrich records, parse custom log formats that aren’t suitable for regular expression-based parsing, and join streams of data with reference datasets. In general, custom actions give you the ability to add reusable logic that can be driven by the same configuration language as the built-in filter, extract, and transform actions. This lets developers focus on building reusable tools that administrators and analysts can leverage for their needs.
Stream processing is changing the way we analyze data, but limiting the accessibility of the data preparation side of stream processing increases costs and decreases velocity. Helping administrators and analysts to build data transformation pipelines with reusable configuration files reduces your reliance on developers to code custom pipelines before the data is available for further analysis.