“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” — Hawkins 1980
We can use a rule-based approach if anomalies can be accurately identified by several rules.
- Possible to filter normal points
- Cannot quantify the degree of anomalies
2.1 Inter Quartile Range (IQR)
Anomalies are the ones outside IQR (Q3-Q1)
q3, q1 = np.percentile(x, [75 ,25])
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
return lower, upper
Suppose the data is normally distributed.
where x_i is a data point, mu is the average of all x_i, sigma is the standard deviation of all x_i. After standardizing all data, the ones greater than the threshold (e.g. 2.5, 3.0, 3.5) are anomalies.
z_score = (x - np.mean(x)) / np.std(x)
2.3 Extreme-Value Analysis
The extreme values in a probability distribution are collectively referred to as the distribution tail. Various statistical methods quantify the probabilities in the tails of distributions. A very low probability value of a tail indicates that a data value inside it should be considered anomalous. A number of tail inequalities bound these probabilities in cases where the actual distribution is not available.
2.4 Minimum Covariance Determinant (MCD)/Gaussian Mixture (GM)
MCD: Assume inliers are generated from a multivariate Gaussian distribution. Fit a hypersphere (ellipsoid) that covers the normal data, data that falls outside the ellipsoid is an outlier. MCD tries to estimate the mean and covariance matrix in a way that minimizes the influence of anomalies.
GM: Inliers are generated from a mixture of Gaussian distribution.
# Source: https://machinelearningmastery.com/model-based-outlier-detection-and-removal-in-python/
# The scikit-learn library provides access to this method via the EllipticEnvelope classfrom sklearn.covariance import EllipticEnvelope# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
yhat = ee.fit_predict(X_train) # inliers = yhat != -1
3. Distance-based Methods
Calculate the average distance of each point to its K-nearest-neighbors, and compare the distance with a threshold. If it’s greater than the threshold, it is considered an outlier.
- Clustering algorithms are optimized to find clusters rather than outliers
- A set of many abnormal data objects that are similar to each other would be recognized as a cluster rather than as noise/outliers
- Cannot quantify the degree of anomalies of each data point
from pyod.models.knn import KNNknn = KNN(method='mean', n_neighbors=5)
y_train_pred = clf.labels_ # 0: inlier, 1: outlier
y_train_scores = clf.decision_scores
4.1 Principal Component Analysis for Anomaly Detection (Shyu et al. 2003)
Approach 1: Map the data to a low-dimensional feature space, and then look at the deviation of data. The eigenvectors from PCA reflect the different directions of variance of original data, and the eigenvalues are the magnitude of variance in the corresponding direction. The eigenvector corresponding to the largest eigenvalue is the direction with the largest data variance, which may indicate abnormal data points.
Approach 2: Map the data to a low-dimensional feature space, and then map back to the original space. Try to reconstruct the original data with low-dimensional features to see the size of the reconstruction error (X_reconstructed = X_low_dim * eigenvector). If a data point is not easily reconstructed, that indicates the feature of the data point is inconsistent with the feature of the overall data points, therefore an abnormal point. We can use reconstruction errors as outlier scores.
- PCA is generally more stable to the presence of a few outliers than the dependent variable analysis methods because PCA computes the errors with respect to the optimal hyperplane, rather than a particular variable.
- PCA can sometimes yield poor results when the scales of the different# dimensions are very different.
# Source: https://github.com/jeffprosise/Machine-Learning/blob/master/Anomaly%20Detection%20(PCA).ipynbfrom sklearn.decomposition import PCApca = PCA(n_components=1, random_state=0)def is_anomaly(data, pca, threshold):
pca_data = pca.transform(data)
restored_data = pca.inverse_transform(pca_data)
loss = np.sum((data - restored_data) ** 2)
return loss > threshold
4.2 One-Class Support Vector Machines (SVMs)
The idea is to find a hyperplane to separate data from the origin and maximizes the distance from the hyperplane to the origin.
We can derive One-Class SVM using SVDD (Support Vector Domain Description). For SVDD, we expect that all samples that are non-abnormalies as inliers. It also uses a hypersphere instead of a hyperplane to separate the data. The algorithm obtains a spherical boundary around the data in the feature space and then minimizes the volume of this hypersphere.
Pros: No assumptions on the data distribution
Cons: Sensitive to outliers
from sklearn import svm
clf = svm.OneClassSVM(gamma=0.1)
y_pred = clf.predict(X) # 1: inlier, -1: outlier
n_error_outlier = y_pred[y_pred == -1].size
5.1 Local Outlier Factor (Breuning et al. 2000)
Density-based algorithm — the algorithm compares the density of a sample to its neighboring points. Specifically, for each data point, find its k-nearest-neighbor and calculate the LOF score.
- K-distance(A)= distance between point A to its k-th nearest neighbor
- Reachability distance (RD) of point A from another point B, RD(A,B) = max(K-distance(B), distance(A,B))
- Local reachability density (LRD) of a point A is the inverse of the average RD of A from its neighbors
- Local outlier factor (LOF) is the ratio of the average LRD of the k-neighbors of A to the LRD of A
If LOF(A) is around 1, it means the local density of data point A is similar to that of its neighbors. If LOF(A) < 1, it means A is in a relatively dense area and not like an anomaly point. If LOF(A) > 1, it means that A is more distant from other points, and it is likely to be an outlier.
- Good when the density of the data is not consistent in the dataset
- Can quantify the degree of anomalies of each data point
- Not require assumptions on data distribution
- Accuracy may be affected in high dimension
- LOF score is a ratio and hence hard to interpret
- Need to calculate the distance between data points, resulting in a O(n²) time complexity (Later there are algorithms that improved the efficiency, e.g. FastLOF (Goldstein, 2012) first randomly divides the entire data into multiple subsets, and then calculates the LOF value in each subset. It removes the LOF score <1, and updates KNN and LOF score)
# Source: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.htmlfrom sklearn.neighbors import LocalOutlierFactorX = [[-1.1], [0.2], [101.1], [0.3]]
clf = LocalOutlierFactor(n_neighbors=2)
# array([ 1, 1, -1, 1])
# array([ -0.9821..., -1.0370..., -73.3697..., -0.9821...])
6.1 Isolation Forest (Liu et al. 2008)
Tree-based method. It calculates how many node splits are required for each data point to be separated into an independent space. Given outliers are far away from inliers, they can be easily separated into an independent space with few splits.
Because each split is random, therefore we need an ensemble method to get convergence (Monte Carlo method), repeatedly split from the beginning, then average the result. iForest consists of t iTree — each of which is a binary tree structure.
- Randomly choose X samples from the training data as subsamples, and put them into the root node of the tree
- Randomly specify a dimension and randomly generate a cut point p (between minimum and maximum of that dimension)
- A hyperplane is generated and divides the current space into two subspaces — samples on subspace A will go to node A’, and samples on subspace B will go to node B’.
- Repeat steps 2 and 3 until there is only one sample in the node or the leaf node has reached the limit depth.
Once we have t iTree, we can use it to evaluate the test data. For a data point x, let x traverse each iTree, see what depth is required for x to be separated into an independent space, then get the average depth.
Once we have the average depth for each data point, set a threshold. The ones with depth < threshold are anomalies.
- No assumptions about the data distribution
- Works with high-dimensional data
- Implementation can be long and requires higher computation power if not appropriately optimized
- Sensitive only to global anomalies
# Source: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
from sklearn.ensemble import IsolationForest
X = [[-1.1], [0.3], [0.5], ]clf = IsolationForest(n_estimators=100, max_samples='auto', contamination=0.05, max_features=4, bootstrap=False, random_state=0).fit(X)clf.predict([[0.1], , ])
#array([ 1, 1, -1])
7. Graph-based Models
Utilize the account and device information collected from users, login, order placement, etc. we can build relationship networks between users and find frauds by
- identifying unusual network structures and communities within the network
- determining good/bad users through network relationships using e.g. label propagation, SIR model, etc.
We can create an autoencoder with an encoding part and a decoding part, scale the data, fit on normal data (no anomalies), get embeddings for normal data and fraud data, then visualize the embedding/train a classifier to classify embeddings.
# Source: https://www.zhihu.com/question/30508773?sort=createdinput_layer = Input(shape=(X.shape,))## encoding part
encoded = Dense(100, activation='tanh', activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoded = Dense(50, activation='relu')(encoded)## decoding part
decoded = Dense(50, activation='tanh')(encoded)
decoded = Dense(100, activation='tanh')(decoded)## output layer
output_layer = Dense(X.shape, activation='relu')(decoded)autoencoder = Model(input_layer, output_layer)
autoencoder.compile(optimizer="adadelta", loss="mse")x = data.drop(["Class"], axis=1)
y = data["Class"].values
x_scale = preprocessing.MinMaxScaler().fit_transform(x.values) x_norm, x_fraud = x_scale[y == 0], x_scale[y == 1]autoencoder.fit(x_norm[0:2000], x_norm[0:2000], batch_size = 256, epochs = 10, shuffle = True, validation_split = 0.20);hidden_representation = Sequential() hidden_representation.add(autoencoder.layers) hidden_representation.add(autoencoder.layers) hidden_representation.add(autoencoder.layers)norm_hid_rep = hidden_representation.predict(x_norm[:3000])
fraud_hid_rep = hidden_representation.predict(x_fraud)