Specialized and hybrid data management and processing engines

A new crop of interesting solutions for the complexity of operating multiple systems in a distributed computing setting.

By Ben Lorica
September 30, 2015
Shinkyo (Sacred Bridge), Nikko, Japan, by Paul Mannix Shinkyo (Sacred Bridge), Nikko, Japan, by Paul Mannix (source: Flickr)

The 2004 holiday shopping season marked the start of Amazon’s investigation into alternative database technologies that led to the creation of DynamoDB — a key-value storage system that went onto inspire several NoSQL projects. A new group of startups began shifting away from the general-purpose systems favored by companies just a few years earlier. In recent years, we’ve seen a diverse set of DBMS technologies that specialize in handling particular workloads and data models such as OLTP, OLAP, search, RDF, XML, scientific applications, etc. The success and popularity of such systems reinforced the belief that in order to scale and “go fast,” specialized systems are preferable.

In distributed computing, the complexity of maintaining and operating multiple specialized systems has recently led to systems that bridge multiple workloads and data models. Aside from multi-model databases, there are an emerging number of storage and compute engines adept at handling different workloads and problems. At this week’s Strata + Hadoop World conference in NYC, I had a chance to interact with the creators of some of these new solutions.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

OLTP (transactions) and OLAP (analytics)

One of the key announcements at Strata + Hadoop World this week was Project Kudu — an open source storage engine that’s good at both table scans (analytics) and random access (updates and inserts). Its creators are quick to point out that they aren’t out to beat specialized OLTP and OLAP systems. Rather, they’re shooting to build a system that’s “70-80% of the way there on both axes.” The project is very young and lacks enterprise features, but judging from the reaction at the conference, it’s something the big data community will be watching. Leading technology research firms have created a category for systems with related capabilities:  HTAP (Gartner) and Trans-analytics (Forrester).

Search and interactive analytics (SQL)

If you had a chance to walk around the large Strata + Hadoop World expo hall, you probably noticed many companies positioning themselves to handle large-scale, real-time, machine-generated data. Many of these companies specifically target log files — given the success of companies like Splunk, there is a proven market for such tools. Moreover, it turns out that search and interactive analytics (SQL) are used by analysts wanting to make sense of massive amounts of log files.  A few startups have attempted to build on open source ecosystem components by combining a search tool (Lucene) and some SQL-on-Hadoop engine.

A while back, I played around with SenseiDB — an open source project that adds a query language (and faceted search) to a search engine — and that experience made me appreciate the power of combining search and SQL. More recently, a new San Francisco Bay Area startup called X15 Software built an engine that combines search and SQL capabilities, and aimed it specifically for analysts who work with log files (and other machine-generated data).

Bounded and unbounded data processing and analytics

One of the takeaways from Tyler Akidau’s extremely popular article on streaming is that our labels — batch and streaming — are fast becoming outdated. Batch and streaming traditionally have been used to describe compute engines, but with the rise of engines that can do both, we’re better off describing the type of data in question: bounded and unbounded/continuous. These “unified” engines come in two flavors: batch engines that can handle streaming problems (e.g., Spark Streaming), and streaming engines that can also be used for batch computations (e.g., Google Dataflow).

Streaming, of course, arises in the context of real-time processing and analytics (a major focus at this year’s conference). One side note: whenever someone tells you that few companies use or need real-time, they’re likely referring to settings where human decision-makers are in the loop (“human real-time”). That misses the mark because the true impact of these technologies will be in applications with no humans in the loop. As UC Berkeley Professor Joe Hellerstein noted a while back, “real-time is for robots.”

Related content:

Post topics: Data science