AI & ML Business Data Innovation Research Security

Try the O’Reilly learning platform

With the O’Reilly learning platform, you get the resources and guidance to keep your skills sharp and stay ahead. Try it free for up to 14 days.

Start trial

Try a course for free

Join a live online event on the O’Reilly platform to learn from the experts shaping tech.

See what’s coming soon

Get the Radar Trends newsletter

Your email

Country

Please read our privacy policy.

Content > Topics

The re-emergence of time-series

Researchers begin to scale up pattern recognition, machine-learning, and data management tools.

By Ben Lorica April 6, 2013 • 3 minute read

LinkedIn X Facebook Threads Bluesky Reddit

My first job after leaving academia was as a quant¹ for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasional forays into machine-learning (clustering, classification, anomalies). More recently, I’ve been closely following the emergence of tools that target large time series and decided to highlight a few interesting bits.

Time series and big data

Over the last six months I’ve been encountering more data scientists (outside of finance) who work with massive amounts of time-series data. The rise of unstructured data has been widely reported, the growing importance of time series much less so. Sources include data from consumer devices (gesture recognition & user interface design), sensors (apps for “self-tracking”), machines (systems in data centers), and health care. In fact some research hospitals have troves of EEG and ECG readings that translate to time-series data collections with billions (even trillions) of points.

Search and machine-learning at scale

Before doing anything else, one has to be able to run queries at scale. Last year I wrote about a team of researchers at UC Riverside who took an existing search algorithm (dynamic time warping²) and got it to scale to time series with trillions of points. There are many potential applications of their research, one I highlighted is from health care:

… a doctor who needs to search through EEG data (with hundreds of billions of points), for a “prototypical epileptic spike”, where the input query is a time-series snippet with thousands of points.

As the size of data grows, the UCR dynamic time-warping algorithm takes time to finish (it takes a few hours for time series with trillions of points). In general (academic) researchers who’ve spent weeks or months collecting data are fine waiting a few hours for a pattern recognition algorithm to finish. But users who come from different backgrounds (e.g. web companies) may not be as patient. Fortunately “search” is an active research area and faster (distributed) pattern recognition systems will likely emerge soon.

Once you scale up search, other interesting problems can be tackled. The UCR team is using their dynamic time-warping algorithm in tasks like classification, clustering, and motif³ discovery. Other teams are investigating techniques from signal-processing, pattern recognition, and trajectory tracking.

Some data management tools that target time series

One of the more popular sessions at last year’s HBase Conference was on OpenTSDB, a distributed, time-series database built on top of HBase. It’s used to store and serve time-series metrics, and comes with tools (based on GNUPlot) for charting. Originally named OpenTSDB2, KairosDB was written primarily for Cassandra (but also works with HBase). OpenTSDB emphasizes tools for readying data for charts (interpolating to fill in missing values), KairosDB distinguishes between data and the presentation of data.

Startup TempoDB offers a reasonably priced, cloud-based service for storing, retrieving, and visualizing time-series data. Still a work in progress SciDB is an open source database project, designed specifically for data intensive science problems. The designers of the system plan to make time-series analysis easy to express within SciDB.

Post topics: Data

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills