The ability to accurately forecast a sequence into the future is critical in many industries: finance, supply chain, and manufacturing are just a few examples. Classical time series techniques have served this task for decades, but now deep learning methods—similar to those used in computer vision and automatic translation—have the potential to revolutionize time series forecasting as well.

Due to their applicability to many real-life problems—such as fraud detection, spam email filtering, finance, and medical diagnosis—and their ability to produce actionable results, deep learning neural networks have gained a lot of attention in recent years. Generally, deep learning methods have been developed and applied to univariate time series forecasting scenarios, where the time series consists of single observations recorded sequentially over equal time increments. For this reason, they have often performed worse than naïve and classical forecasting methods, such as exponential smoothing (ETS) and autoregressive integrated moving average (ARIMA). This has led to a general misconception that deep learning models are inefficient in time series forecasting scenarios, and many data scientists wonder whether it’s really necessary to add another class of methods—such as convolutional neural networks or recurrent neural networks—to their time series toolkit.

In this post, I'll discuss some of the practical reasons why data scientists may still want to think about deep learning when they build time series forecasting solutions.

## Deep learning neural networks: Some foundational concepts

The goal of machine learning is to find features to train a model that transforms input data (such as pictures, time series, or audio) to a given output (such as captions, price values, transcriptions). Deep learning is a subset of machine learning algorithms that learn to extract these features by representing input data as vectors and transforming them with a series of clever linear algebra operations into a given output.

Data scientists then evaluate whether the output is what they expected using an equation called loss function. The goal of the process is to use the result of the loss function from each training input to guide the model to extract features that will result in a lower loss value on the next pass. This process has been used to cluster and classify large volumes of information—for example, millions of satellite images; thousands of video and audio recordings from YouTube; and historical, textual, and sentiment data from Twitter.

Deep learning neural networks have three main intrinsic capabilities:

- They can learn from arbitrary mappings from inputs to outputs
- They support multiple inputs and outputs
- They can automatically extract patterns in input data that spans over long sequences

Thanks to these three characteristics, deep learning neural networks can oﬀer a lot of help when data scientists deal with more complex but still very common problems, such as time series forecasting.

Here are three reasons data scientists should consider adding deep learning to their time series toolkits.

## Reason #1: Deep learning neural networks are capable of automatically learning and extracting features from raw and imperfect data

Time series is a type of data that measures how things change over time. In time series, *time* isn’t just a metric, but a primary axis. This additional dimension represents both an opportunity and a constraint for time series data because it provides a source of additional information but makes time series problems more challenging, as specialized handling of the data is required. Moreover, this temporal structure can carry additional information, like trends and seasonality, that data scientists need to deal with in order to make their time series easier to model with any type of classical forecasting methods.

Neural networks can be useful for time series forecasting problems by eliminating the immediate need for massive feature engineering processes, data scaling procedures, and the need for making the data stationary by differencing.

In real-world time series scenarios—for example, weather forecasting, air quality and traffic flow forecasting, and forecasting scenarios based on streaming IoT devices like geo-sensors—irregular temporal structures, missing values, heavy noise, and complex interrelationships between multiple variables present limitations for classical forecasting methods. These techniques typically rely on clean, complete data sets in order to perform well: missing values, outliers, and other imperfect features are generally unsupported.

Speaking of more artificial and perfect data sets, classical forecasting methods are based on the assumption that a linear relationship and a ﬁxed temporal dependence exist among variables of a data set, and this assumption by default excludes the possibility of exploring more complex (and probably more interesting) relationships among variables. Data scientists must make subjective judgements when preparing data for classical analysis—like the lag period used to remove trends—which is time consuming and introduces human biases to the process. On the contrary, neural networks are robust to noise in input data and in the mapping function, and can even support learning and prediction in the presence of missing values.

Convolutional neural networks (CNNs) are a category of neural networks that have proven very effective in areas such as image recognition and classification. CNNs have been successful in identifying faces, objects, and traffic signs in addition to powering vision in robots and self-driving cars. CNNs derive their name from the “convolution” operator. The primary purpose of convolution in the case of CNNs is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. In other words, the model learns how to automatically extract the features from the raw data that are directly useful for the problem being addressed. This is called "representation learning" and the CNN achieves this in such a way that the features are extracted regardless of how they occur in the data, so-called "transform" or "distortion" invariance.

The ability of CNNs to learn and automatically extract features from raw input data can be applied to time series forecasting problems. A sequence of observations can be treated like a one-dimensional image that a CNN model can read and refine into the most relevant elements. This capability of CNNs has been demonstrated to great eﬀect on time series classiﬁcation tasks, such as indoor movement prediction using wireless sensor strength data to predict the location and motion of subjects within a building.

## Reason #2: Deep learning supports multiple inputs and outputs

Real-world time series forecasting is challenging for several reasons, such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites. Deep learning algorithms can be applied to time series forecasting problems and offer benefits such as the ability to handle multiple input variables with noisy complex dependencies. Specifically, neural networks can be conﬁgured to support an arbitrary but ﬁxed number of inputs and outputs in the mapping function. This means that neural networks can directly support multivariate inputs, providing direct support for multivariate forecasting. A univariate time series, as the name suggests, is a series with a single time-dependent variable. For example, if we want to predict the next energy consumption in a specific location: in a univariate time series scenario, our data set will be based on two variables: time values and historical energy consumption observations.

A multivariate time series has more than one time-dependent variable. Each variable depends not only on its past values, but also has some dependency on other variables. This dependency is used for forecasting future values. Let’s consider the above example again. Now suppose our data set includes weather data, such as temperature values, dew point, wind speed, cloud cover percentage, etc., along with the energy consumption value for the past four years. In this case, there are multiple variables to be considered to optimally predict an energy consumption value. A series like this would fall under the category of a multivariate time series.

With neural networks, an arbitrary number of output values can be speciﬁed, offering direct support for more complex time series scenarios that require multivariate forecasting and even multi-step forecast methods. There are two main approaches to using deep learning methods to make multi-step forecasts: 1) direct, where a separate model is developed to forecast each forecast lead time; and 2) recursive, where a single model is developed to make one-step forecasts, and the model is used recursively where prior forecasts are used as input to forecast the subsequent lead time.

The recursive approach can make sense when forecasting a short contiguous block of lead times, whereas the direct approach may make more sense when forecasting discontiguous lead times. The direct approach may be more appropriate when we need to forecast a mixture of multiple contiguous and discontiguous lead times over a period of a few days; such is the case, for example, with air pollution forecasting problems or for anticipatory shipping forecasting, used to predict what customers want and then ship the products automatically.

Key to the use of deep learning algorithms for time series forecasting is the choice of multiple input data. We can think about three main sources of data that can be used as input and mapped to each forecast lead time for a target variable; they are: 1) univariate data, such as lag observations from the target variable that is being forecasted; 2) multivariate data, such as lag observations from other variables (for example, weather and targets in case of air pollution forecasting problems); 3) metadata, such as data about the date or time being forecast. Data can be drawn from across all chunks, providing a rich data set for learning a mapping from inputs to the target forecast lead time.

## Reason #3: Deep learning networks are good at extracting patterns in input data that span over relatively long sequences

Deep learning is an active research area, and CNNs are not the only class of neural network architectures being used for time series and sequential data. Recurrent neural networks (RNNs) were created in the 1980s but have been recently gaining popularity and increased computational power from graphic processing units. They are especially useful with sequential data because each neuron or unit can use its internal memory to maintain information about the previous input. An RNN has loops that allow information to be carried across neurons while reading in input.

However, a simple recurrent network suffers from a fundamental problem of not being able to capture long-term dependencies in a sequence. This is a major reason why RNNs faded from practice for a while until some great results were achieved using a long short-term memory (LSTM) unit inside the neural network. Adding the LSTM to the network is like adding a memory unit that can remember context from the very beginning of the input.

LSTM neural networks are a particular type of RNN that have internal contextual state cells that act as long-term or short-term memory cells. The output of the LSTM network is modulated by the state of these cells. This is a very important property when we need the prediction of the neural network to depend on the historical context of inputs, rather than only on the very last input. They are a type of neural network that adds native support for input data comprised of sequences of observations. The addition of sequence is a new dimension to the function being approximated. Instead of mapping inputs to outputs alone, the network can learn a mapping function for the inputs over time to an output. The example of video processing can be very effective when we need to understand how LSTM networks work: in a movie, what happens in the current frame is heavily dependent on what was in the previous frame. Over a period of time, an LSTM network tries to learn what to keep and how much to keep from the past, and how much information to keep from the present state, which makes it powerful compared to other types of neural networks.

This capability can be used in any time series forecasting context, where it can be extremely helpful to automatically learn the temporal dependence from the data. In the simplest case, the network is shown one observation at a time from a sequence and can learn which prior observations are important and how they are relevant to forecasting. The model both learns a mapping from inputs to outputs and learns what context from the input sequence is useful for mapping and can dynamically change this context as needed. Not surprisingly, this approach has been often used in the finance industry to build models that forecast exchange rates based on the idea that past behavior and price patterns may affect currency movements and can be used to predict future price behavior and patterns.

On the other hand, there are downsides that data scientists need to be careful about with neural network architectures. Large volumes of data are required, and models require hyper-parameter tuning and multiple optimization cycles.

## Conclusion

Deep learning neural networks are powerful engines capable of learning from arbitrary mappings from inputs to outputs, supporting multiple inputs and outputs, and automatically extracting patterns in input data that span long sequences of time. All these characteristics together make neural networks helpful tools when dealing with more complex time series forecasting problems that involve large amounts of data, multiple variables with complicated relationships, and even multi-step time series tasks. A lot of research has been invested into using neural networks for time series forecasting with modest results. Perhaps the most promising area in the application of deep learning methods to time series forecasting is in the use of CNNs, LSTMs, and hybrid models.

## Useful resources

Recent improvements in tools and technologies has meant that techniques like deep learning are now being used to solve common problems, including forecasting, text mining, language understanding, and personalization. Below are some useful resources and presentations involving deep learning:

- Forecasting Financial Time Series with Deep Learning on Azure
- Deep Learning and AI Frameworks
- Silver, Gold & Electrum: 3 Data Techniques for Multi-Task Deep Learning
- Neural Networks for Forecasting Financial and Economic Time Series
- Deep Learning Virtual Machine
- PyTorch on Azure: Deep learning in the Oil and Gas Industry
- AI and Machine Learning in the Enterprise
- Deep Learning
- Temporal Data and Time Series Analytics