CHAPTER 17Outlier Detection

Outlier detection, or anomaly detection, is a data science technique aiming to identify observations which do not fit into the data set and are way too different from the “normal” data. In general, there are two reasons why such nonconformity in the data may exist. First, the anomalous data points can be produced by some technical error, where, for example, the data recording technique failed, or the survey has not been correctly scanned, among others. A particular example can be limit order book data, where bid price is recorded as images. Such a datum corresponds to a failed print rather than to a situation where the interest in the particular asset is negative. Such a transaction would not be technically possible in properly working venues.

Second, the more critical reason corresponds to actual anomalous behaviour in the data. This may correspond, for example, to cases of bank frauds in the context of bank activity records. When a fraudster commits the unlawful act, she often tries to make the transaction to look usual. This, however, is not possible as the knowledge of usual transactions is not publicly available. The outliers may thus appear in the data analysis as records with the unusual size of the transaction, timing of trades, or their frequency do not fit into the common patterns.

The outlier detection is by its nature a data science technique, ...

Get Machine Learning and Big Data with kdb+/q now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.