In recent months, I've been hearing about hybrid systems designed to handle different data management needs. At Strata + Hadoop World NYC last week, Cloudera's Todd Lipcon unveiled an open source storage layer — Kudu — that's good at both table scans (analytics) and random access (updates and inserts). During the latest episode of the O'Reilly Data Show Podcast, I sat down with Lipcon to discuss his new project a few weeks before it was released.
HDFS and Hbase
[Hadoop is] more like a file store. It allows you to upload files onto an arbitrarily sized cluster with 20-plus petabytes, in single clusters. The thing is, you can upload the files but you can't edit them in place. To make any change, you have to basically put in a new file. What HBase does in distinction is that it has more of a tabular data model, where you can update and insert individual row-by- row data, and then randomly access that data [in] milliseconds. The distinction here is that HDFS is pretty good for large scans where you're putting in a large data set, maybe doing a full parse over the data set to train a machine learning model or compute an aggregate. If any of that data changes on a frequent basis or if you want to stream the data in or randomly access individual customer records, you're kind of out of luck on HDFS.
... HBase is this layer on top that gives you those capabilities. Kind of the downside — why wouldn't you always use HBase — is that HBase's performance, when you're doing these large scans, like looking over your entire data set to compute an aggregate, the performance is significantly worse than the throughput-oriented performance that HDFS gives you. Occasionally, we make the comparison that HBase is more like a Ferrari, and HDFS is more like a freight train. The freight train is much better for moving a lot of data across the country; the Ferrari is a lot better for zipping around — but you're not going to transport a load of oil on a Ferrari.
Building a new storage system that complements HBase and HDFS
As I mentioned, you're kind of stuck between a rock and a hard place with HBase being good at random access but not great at scans. HDFS is terrific at scans, can't do random access at all. The idea with Kudu is, we're building a data store that [is] pretty darn good at both. ... If you're 70-80% of the way there on both axis, then the convenience you get out of having a single system, for most people, will win out because engineering time is expensive and computers are cheap.
... In the IoT use case, you're probably less interested in updates, but one thing that is popular is random access in that workload. You may have a bunch of time series … You do some big analytics to do some modeling, but then someone actually comes and wants to repair a particular machine, and they need to get pretty quick access to scan that time series from a particular date range, and they need to do that in a sub-second kind of timescale. Scanning an entire data set to find that data is not going to work. If you put this data in Kudu, you get very low latency random access to individual time series, individual time ranges, and also very, very good storage efficiency because we're a true column store, so we can do things like delta encoding and bit packing, do some more advanced techniques for compression that can get us down to maybe one byte per record.
... I should mention the actual access mechanisms for Kudu. We're just a data store. We don't have our own SQL parser or anything like that. If you want to query the Kudu data, you can use Impala, and you'll get very good query performance from Impala. You could also just use Spark, so we can use RDD in Spark, and map it against the Kudu table.
Specialized systems: Convenience vs. performance
I think specialized systems will always have their place. There are some workloads where you're really resource constrained and performance is the most important metric, or performance per dollar is the most important metric. I sort of see this as a comparison with operating systems, where almost everybody uses general purpose operating systems, and if you look back five or 10 years, mobile phones didn't use general purpose operating systems. Now, most of them are Linux-based. ... That said, there are still embedded systems out there; there are still companies that produce embedded real-time OSes, and those aren't going to go anywhere, either.
... I think the same thing will happen with data storage, where people say, 'Hey, if this other system gets me 80% of the way there, in most cases that's fine. In the rare case where I'm actually running this on 1,000 nodes or 5,000 nodes or something, and I have only this one application, I don't care about convenience of three different access mechanisms; I just want this one specialized thing, and it's going to save me 300 machines' worth of budget to do it, maybe it's worth it.' For most folks, the engineering time is just going to win out over buying another couple machines. I think in distributed systems in particular, this is the case, where really the performance tradeoff can just be solved by buying more boxes up to a certain point.
- Kudu: Resolving transactional and analytic trade-offs in Hadoop, video of Todd Lipcon's recent presentation at Strata + Hadoop World NYC.
- Specialized and hybrid data management and processing engines
- Coming full circle with Bigtable and HBase: Data Show episode featuring Michael Stack
- From search to distributed computing to large-scale information extraction: Data Show episode featuring Mike Cafarella, co-founder of Hadoop, Nutch, and ClearCutAnalytics.