AI & ML Business Data Innovation Research Security

Try the O’Reilly learning platform

With the O’Reilly learning platform, you get the resources and guidance to keep your skills sharp and stay ahead. Try it free for up to 14 days.

Start trial

Try a course for free

Join a live online event on the O’Reilly platform to learn from the experts shaping tech.

See what’s coming soon

Get the Radar Trends newsletter

Your email

Country

Please read our privacy policy.

Content > Topics > Data

Interactive big data analysis using approximate answers

As data sizes continue to grow, interactive query systems may start adopting the sampling approach central to BlinkDB.

By Ben Lorica August 17, 2013 • 3 minute read

LinkedIn X Facebook Threads Bluesky Reddit

Fractal art. (source: Pixabay)

Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, and a combination of other techniques including data co-partitioning, caching (into main memory), runtime code generation, and columnar storage.

One approach that hasn’t been exploited as much is sampling. By this I mean employing samples to generate approximate answers, and speed up execution. Database researchers have written papers on approximate answers, but few working (downloadable) systems are actually built on this approach.

Approximate query engine from U.C. Berkeley’s Amplab

An interesting, open source database released yesterday⁰ uses sampling to scale to big data. BlinkDB is a massively-parallel, approximate query system from UC Berkeley’s Amplab. It uses a series of data samples to generate approximate answers. Users compose queries by specifying either error bounds or time constraints, BlinkDB uses sufficiently large random samples to produce answers. Because random samples are stored in memory¹, BlinkDB is able to provide interactive response times:

Decision making and uncertainty

BlinkDB uses sampling to provide fast query response times. A side-effect is that query results are accompanied by error bars². While it’s not the central goal of BlinkDB, approximate query systems change (in a good way) how decision makers think of business intelligence. Since answers in BlinkDB come with error bars, focus shifts from obsessing over the “one true answer”, to recognizing that variability and noise are present in data sets³. This requires a certain amount of reorientation as error representation lead to more nuanced discussions (imagine a scenario⁴ where an error bar spans a region between “fine” and “disaster”).

Sampling as an alternative

As data sets continue to grow, other interactive query systems may start adopting the sampling approach central to BlinkDB. Other areas of data science routinely use sampling: stats students learn early on the difference between a “statistic” & a “parameter”. Within machine-learning, a popular approach to scaling up algorithms is sampling⁵. When prototyping a data processing pipeline, I sometimes use small random samples to get a sense⁶ of the data. Sampling is also starting to appear in commercial big data systems. Users of Datameer build their analysis workflows using samples, and when they’re ready, run their MapReduce jobs against massive data sets.

To learn more about BlinkDB, Spark, Shark, Mesos, and other components of the Berkeley Data Analytics Stack, come to the third AMP Camp Big Data Bootcamp at the end of August.

Related posts:

Post topics: Data

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Try the O’Reilly learning platform

Try a course for free

Get the Radar Trends newsletter

Thank you for subscribing to the O’Reilly Radar Trends to Watch newsletter.

Interactive big data analysis using approximate answers