Editor’s note: Ben Lorica is an advisor to Databricks and Graphistry.
There’s renewed interest in stream processing and analytics. I write this based on some data points (attendance in webcasts and conference sessions; a recent meetup), and many conversations with technologists, startup founders, and investors. Certainly, applications are driving this recent resurgence. I’ve written previously about systems that come from IT operations as well as how the rise of cheap sensors are producing stream mining solutions from wearables (mostly health-related apps) and the IoT (consumer, industrial, and municipal settings). In this post, I’ll provide a short update on some of the systems that are being built to handle large amounts of event data.
Apache projects (Kafka, Storm, Spark Streaming, Flume) continue to be popular components in stream processing stacks (I’m not yet hearing much about Samza). Over the past year, many more engineers started deploying Kafka alongside one of the two leading distributed stream processing frameworks (Storm or Spark Streaming). Among the major Hadoop vendors, Hortonworks has been promoting Storm, Cloudera supports Spark Streaming, and MapR supports both. Kafka is a high-throughput distributed pub/sub system that provides a layer of indirection between “producers” that write to it and “consumers” that take data out of it. A new startup (Confluent) founded by the creators of Kafka should further accelerate the development of this already very popular system. Apache Flume is used to collect, aggregate, and move large amounts of streaming data, and is frequently used with Kafka (Flafka or Flume + Kafka). Spark Streaming continues to be one of the more popular components within the Spark ecosystem, and its creators have been adding features at a rapid pace (most recently Kafka integration, a Python API, and zero data loss).
Apache HBase, Apache Cassandra, and Elasticsearch are popular open source options for storing event data (the Team Apache stack of Cassandra, Kafka, Spark Streaming is an increasingly common combination). Time-series databases built on top of open-source nosql data stores — OpenTSDB (HBase) and Kairos — continue to have their share of users. The organizers of HBaseCon recently told me that OpenTSDB remains popular in their community. Advanced technical sessions on the ELK stack (Elasticsearch, Logstash, Kibana) were among the best-attended presentations at the recent Elasticon conference.
Database software vendors have taken note of the growing interest in systems for handling large amounts of real-time event data. The use of memory and SSDs have given Aerospike, memsql, VoltDB, and other vendors interesting products in this space.
At the end of the day, companies are interested in using these software components to build data products or improve their decision-making. Various combinations of the components I’ve listed above appear in the stream processing platforms of many companies. Big Internet firms such as Netflix, Twitter, and Linkedin describe their homegrown (streaming) data platforms to packed audiences at conferences such as Strata + Hadoop World.
Designing and deploying data platforms based on distributed, open source, software components requires data engineers who know how to evaluate and administer many pieces of technology. I have been noticing that small-to-medium-sized companies are becoming much more receptive to working with cloud providers: Amazon and Google have components that mirror popular open source projects used for stream processing and analysis, Databricks Cloud is an option that quite a few startups that use Spark are turning to.
As far as focused solutions, I’ve always felt that some of the best systems will emerge from IT operations and data centers. There are many solutions for collecting, storing, and analyzing event data (“logs”) from IT operations. Some companies piece together popular open source components and add proprietary technologies that address various elements of the streaming pipeline (move, refine, store, analyze, visualize, reprocess, streaming “joins”). Companies such as Splunk, SumoLogic, ScalingData, and CloudPhysics use some of these open source software components in their IT monitoring products. Some users of Graphite have turned to startups such as Anodot or SignalFx because they need to scale to larger data sizes. Another set of startups such as New Relic provide SaaS for monitoring software app performance (“software analytics”).
While IT operations and data centers are the areas I’m most familiar with, note that there are interesting stream processing and analysis software systems that power solutions in other domains (we devoted a whole day to them at Strata + Hadoop World in NYC in 2014). As an example, GE’s Predix is used in industrial settings, and I’ve seen presentations on similar systems for smart cities, agriculture, health care, transportation, and logistics.
One topic that I’ve been hearing more about lately is the use of graph mining techniques. Companies such as Graphistry, SumoLogic, and Anodot have exploited the fact that log file entries are related to one another, and these relationships can be represented as network graphs. Thus network visualization and analysis tools can be brought to bear on some types of event data (“from time-series to graphs”).
Stream processing and data management continues to be an area of intense activity and interest. Over the next year, I’ll be monitoring progress on the stream mining and analytics front. Most of the tools and solutions remain focused on simple (approximate) counting algorithms (such as identifying heavy hitters). Companies such as Numenta are tackling real-time pattern recognition and machine-learning. I’d like to see similar efforts built on top of the popular distributed, open source frameworks data engineers have come to embrace. The good news is the leaders of key open source stream processing projects plan to tack on more analytic capabilities.