conference

Working with time series: Denoising and imputation frameworks to improve data density

by Anjali Samani

February 2020

Intermediate

39m

English

O'Reilly Media, Inc.

Closed Captioning available in German, English, Spanish, French, Japanese, Korean, Portuguese (Portugal, Brazil), Chinese (Simplified), Chinese (Traditional)

Overview

Increasingly, organizations are looking beyond conventional data provided by data aggregators and vendors in their industry. But alternative data, because of the way it’s generated and collected, is typically noisy and often ephemeral. A model’s ability to learn and correctly predict future outcomes is greatly influenced by the underlying data. Clean, complete data can make the difference between deriving correct and incorrect conclusions. Incomplete data can restrict its application to only a small set of techniques. And for alternative data sources, missed data is almost impossible to recover.

Anjali Samani (CircleUp) explains two simple frameworks for evaluating a dataset’s candidacy for smoothing and quantitatively determining the optimal imputation strategy and the number of consecutive missing values that can be imputed without material degradation in signal quality.

To extract meaningful signals from alternative data, it’s necessary to apply denoising and imputation to generate clean and complete time series. There are numerous ways to smooth a noisy data series and impute missing values, each with relative strengths and weaknesses. Smoothing removes noise from the data and allows patterns and trends to be identified more easily. It can, however, make a series appear less volatile than it is and may mask the very patterns you’re seeking to identify. So you have to know when you should and shouldn’t smooth a series, and if it is smoothed, what type of smoothing you should apply.

Similarly, missing observations in time series can be imputed in many ways. These are covered in detail in both academic and practitioner literature. What caused the missing values in the first place and how the data is going to be used in downstream applications can often inform the most appropriate strategy for imputation. However, when there are multiple options to choose from, you have to objectively choose between different strategies and identify how many consecutive missing values can be safely imputed.

Prerequisite knowledge

A basic understanding of techniques such as simple and exponentially weighted moving averages, median filters, and linear interpolation and metrics such as root mean square error, mean absolute and percent errors, and relative error

What you'll learn

Gain frameworks for evaluating a dataset’s candidacy for smoothing and quantitatively determining the optimal imputation strategy and the number of consecutive missing values that can be imputed without material degradation in signal quality

This session is from the 2019 O'Reilly Strata Conference in New York, NY.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Watch now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnb

BlueOrigin

Electronic Arts

HomeDepot

Nasdaq

Rakuten

Tata Consultancy Services

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

You might also like

Building Custom Transformers and Estimators to Extend PySpark's ML Pipelines

Building Custom Transformers and Estimators to Extend PySpark's ML Pipelines

Jonathan Rioux

All Models are Wrong, but Some are Useful. Especially with the Right Data

All Models are Wrong, but Some are Useful. Especially with the Right Data

Data Science Salon

Enabling Better Shopping Experiences Using Technology, Data, & Fashion Attributes

Enabling Better Shopping Experiences Using Technology, Data, & Fashion Attributes

Data Science Salon

Framing business problems as machine learning (ML) problems (sponsored by Amazon Web Services)

Framing business problems as machine learning (ML) problems (sponsored by Amazon Web Services)

Carlos Escapa

Publisher Resources

ISBN: 0636920372066