## Earth is an outlier — the theory

## What are outliers?

We live on an outlier. Earth is the only hump of rock with life in the Milky Way galaxy. Other planets in our galaxy are inliers or normal data points in a so-called database of stars and planets.

There are many definitions of outliers. In simple terms, we define outliers as data points that are significantly different than the majority in a dataset. Outliers are the rare, extreme samples that don’t conform or align with the inliers in a dataset.

Statistically speaking, outliers come from a different distribution than the rest of the samples in a feature. They present statistically significant abnormalities.

These definitions depend on what we consider “normal”. For example, it is perfectly normal for CEOs to make millions of dollars, but if we add their salary information to a dataset of household incomes, they become abnormal.

Outlier detection is the field of statistics and machine learning that uses various techniques and algorithms to detect such extreme samples.

## Why bother with outlier detection?

But why, though? Why do we need to find them? What’s the harm in them? Well, consider this distribution of 12 numbers ranging from 50 to 100. One of the data points is 2534, which is clearly an outlier.

import numpy as nparray = [97, 87, 95, 62, 53, 66, 2534, 60, 68, 90, 52, 63, 65]

[97, 87, 95, 62, 53, 66, 2534, 60, 68, 90, 52, 63, 65]

array

Mean and standard deviation are two of the most heavily-used and critical attributes of a distribution, so we must feed realistic values of these two metrics when fitting machine learning models.

Let’s calculate them for our sample distribution.

The mean:

np.mean(array)260.9230769230769

The standard deviation:

np.std(array)656.349984212042

Now, let’s do the same, removing the outlier:

# Array without the outlier

array_wo = [97, 87, 95, 62, 53, 66, 60, 68, 90, 52, 63, 65]np.mean(array_wo)

71.5np.std(array_wo)15.510748961069977

As you can see, the outlier-free distribution has a 3.6 times smaller mean and almost 45 times smaller standard deviation.

Apart from skewing the actual values of mean and STD, outliers also create noise in training data. They create trends and attributes in distributions that distract machine learning models from actual patterns in the data, resulting in performance losses.

Therefore, it is paramount to find outliers, explore the reasons for their presence, and remove them if appropriate.

## What you will learn in this tutorial

Once you understand the important theory behind the process, outlier detection is straightforward to perform in code with libraries like PyOD or Sklearn. For example, here is how to do outlier detection using a popular Isolation Forest algorithm.

from pyod.models.iforest import IForestiforest = IForest().fit(training_features)

# 0 for inliers, 1 for outliers

labels = iforest.labels_outliers = training_features[labels == 1]

136

len(outliers)

It only takes a few lines of code.

Therefore, this tutorial will focus more on theory. Specifically, we will look at outlier detection in the context of unsupervised learning, the concept of contamination in datasets, the difference between anomalies, outliers, and novelties, and univariate/multivariate outliers.

Let’s get started.

## Outlier detection is an unsupervised problem

Unlike many other ML tasks, outlier detection is an unsupervised learning problem. What do we mean by that?

For example, in classification, we have a set of features that map to specific outputs. We have labels that tell us which sample is a dog and which one is a cat.

In outlier detection, that’s not the case. We have no prior knowledge of outliers when we are presented with a new dataset. This causes several challenges (but nothing we can’t handle).

First, we won’t have an easy way of measuring the effectiveness of outlier detection methods. In classification, we used metrics such as accuracy or precision to measure how well the algorithm fits to our training dataset. In outlier detection, we can’t use these metrics because we won’t have any labels that allow us to compare predictions to ground truth.

And since we can’t use traditional metrics to measure performance, we can’t efficiently perform hyperparameter tuning. This makes it even hard to find the best outlier classifier (an algorithm that returns inlier/outlier labels for each dataset row) for the task at hand.

However, don’t despair. We will see two excellent workarounds in the next tutorial.

## Anomalies vs. outliers vs. novelties

You’ll see the terms “anomalies” and “novelties” often cited next to outliers in many sources. Even though they are close in meaning, there are important distinctions.

An anomaly is a general term that encompasses anything out of the ordinary and abnormal. Anomalies can refer to irregularities in either training or test sets.

As for outliers, they only exist in training data. Outlier detection refers to finding abnormal data points from the training set. Outlier classifiers only perform a `fit`

to the training data and return inlier/outlier labels.

On the other hand, novelties exist only in the test set. In novelty detection, you have a clean, outlier-free dataset, and you are trying to see if new, unseen observations have different attributes than the training samples. Hence, irregular instances in a test set become novelties.

In short, anomaly detection is the parent field of both outlier and novelty detection. While outliers only refer to abnormal samples in the training data, novelties exist in the test set.

This distinction is essential for when we start using outlier classifiers in the next tutorial.

## Univariate vs. multivariate outliers

Univariate and multivariate outliers refer to outliers in different types of data.

As the name suggests, univariate outliers only exist in single distributions. An example is a very tall person in a dataset of height measurements.

Multivariate outliers are a bit tricky. They refer to outliers with two or more attributes, which, when looked at individually, don’t appear anomalous but only become outliers when all attributes are considered in unison.

An example multivariate outlier can be an old car with very low mileage. The attributes of this car may be normal when looked at individually, but when combined, you’ll realize that old cars usually have high mileage proportional to their age. (There are many old cars and many cars with low mileage, but there are few cars that are both old and have low mileage).

When choosing an algorithm to detect them, the distinction between types of outliers becomes important.

As univariate outliers exist in datasets with only one column, you can use simple and lightweight methods such as z-scores or modified z-scores.

Multivariate outliers pose a more significant challenge since they may only surface across many dataset columns. For that reason, you must take out big guns such as Isolation Forest, KNN, Local Outlier Factor, etc.

In the coming tutorials, we’ll see how to use some of the above methods.

## Conclusion

There you go! You now know all the essential terminology and theory behind outlier detection, and the only thing left is applying them in practice using outlier classifiers.

In the next parts of the article, we will cover some of the most popular and robust outlier classifiers using the PyOD library. Stay tuned!

More articles from…