8.4. CLEANING DATA

Having established the types and importance of data, we now turn to the kinds of problems quants face in managing these raw materials and how they handle such flaws. Despite the efforts of primary, secondary, and sometimes even tertiary data vendors, data are often either missing or incorrect in some way. If ignored, this problem can lead to disastrous consequences for the quant. This section addresses some of the common problems found with errors and some of the better-known approaches used to deal with these challenges. It's worth noting that although some of the following data problems seem egregious or obvious to a human, it can be challenging to notice such problems in a trading system that is processing millions of data points hourly (or even within one minute, as in the case of high-frequency traders).

The first common type of data problem is missing data, as we alluded to. Missing data occur when a piece of information existed in reality but for some reason was not provided by the data supplier. This is obviously an issue because without data, the system has nothing to go on. Worse still, by withholding just some portion of the data, systems can make erroneous computations. Two common approaches are used to solve the problem of missing data. The first is to build the system so that it "understands" that data can in fact go missing, in which case the system doesn't act rashly when there are no data over some limited time period. For example, many databases ...

Get Inside the Black Box: The Simple Truth About Quantitative Trading now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.