[ad_1]

**Unsupervised learning **differs from supervised learning because there is no correct output in the training data. For example, in a cluster analysis, the goal is to separate the data into several disjoint groups so that samples in the same group are similar and samples from different groups are ‘different.’

For example:

**Customer clustering for a better marketing**

Different types of customers may behave differently. Thus, different marketing plans/strategies can be applied to maximize the return. For example, in the display advertising space, customers are usually clustered into different types of intention/interest groups, and ads that are more likely to be related to their interest are served to maximize the ads’ campaign ROI.

**Outlier detection for pricing errors**

Some retail products’ price-demand elasticity may differ from the rest of 99%; in an automated pricing engine, we can use popular methods such as **Tukey’s IQR (interquartile range)** or **KDE (kernel density estimation) **to identify products whose prices were modeled incorrectly. For example, if the model suggests we should price a digital camera at a very low price due to some input data issue, outlier detection can help prevent this from happening for an online e-commerce store.

Yes, it’s possible to run into overfitting in an unsupervised learning setting.

For example: when you build a K-means cluster model, the number of K was selected to be the same as the number of samples.

There are 3 common methods that I have used for cluster analysis:

**1. K-means clustering**

We pick a number K and start with randomly selected K points, we assign the rest of the data points to each one of those K points that they are closest to. After that, we then calculate the centroid of those K newly formed clusters and repeat the process until the total sum of squared distance from every point to its cluster centroid no longer improves.

**2. Hierarchical clustering**

There are two ways to perform a hierarchical clustering:

**a. Agglomerative **(bottom-up approach, more common): each data point starts as its own cluster, the closest two points are then merged to form a cluster, and the process is repeated until reaching certain stopping criteria.

**b. Divisive (top-down approach, less common)**: all data points start from the same cluster, the point that is the furthest away from the centroid is split to form another cluster, and the process is repeated until reaching certain stopping criteria.

**3. Spectral clustering**

The spectral clustering works much better than the other two methods when the data points are not linearly separable.

Rather than trying to separate data points in their original input data space, the data points are represented as a node in a graph, and the problem is transformed into a graph partitioning problem.

The cut/separation of the data points is done in a new mapped space where they are separable.

A common way to find the ‘optimal’ number of clusters is to use the ‘Elbow method’ and try different values of K, for example, 3 to 10.

For each value of K, we calculate the total sum of squared distance from every points to its centroid.

At a certain value of K, the total squared distance will no longer improve, and that is the optimal number we can choose.

K-means algorithm starts with randomly selected K points, we assign the rest of the data points to each one of those K points that they are closest to. After that, we then calculate the centroid of those K newly formed clusters and repeat the process until the total sum of squared distance from every point to its cluster centroid no longer improves.

When the total squared distance from every point to its centroid no longer improves, we can stop iterating.

There are two ways to perform a hierarchical clustering:

The** Agglomerative **(bottom-up approach is more commonly used, here is how it works:

Each data point starts as its own cluster, the closest two points are then merged to form a cluster, and the process is repeated until reaching certain stopping criteria.

The stopping criteria for hierarchical clustering are usually when we reach the number of clusters (pre-determined), the most common method to decide the number of clusters is to use the Elbow method, which gives us a heuristic number to use.

For any number beyond this heuristic, the amount of variance explained by the number of clusters will no longer improve significantly.

There are several methods to represent the distance between a point and a cluster.

**Single linkage:**the**Complete****linkage:**the distance between two clusters is represented by their two most distant points.**Average****linkage**. distance between two clusters is represented by the average distance between every pair of data points from the two clusters.

We can use spectral clustering to cluster non-linearly separable data points.

The data points are treated as nodes in a graph, and the clustering problem is restructured to be a graph partition problem, where the points are projected into another space where they can be separated.

There are several steps that are involved in such projection:

- We need to create an adjacency matrix
**A:**where entry (i, j) represents the edge between data points “i” and “j.” This is a symmetric matrix. - We then create a degree matrix
**D**, where only the diagonal of the matrix is non-zero and represents the number of connected data points this particular data point has. - We finally compute the graph Laplacian as
**L = W — A** - We can then compute the Eigenvalues and Eigenvectors of
**L**, using the first few non-zero eigenvectors to project our data points into another space, where we can apply K-means after the projection to form our new clusters.

**Dimensionality reduction**, or **dimension reduction**, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation is without losing important information from the original data.

Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse due to the curse of dimensionality.

Principal component analysis (PCA) is a linear method that helps us find the right settings so that when we linearly rotate our original data based on the new linear projections, we can successively maximize variance.

The newly formed variables (a linear combination of original variables) also have nice features, such as they are all uncorrelated.

The PCA is performed by computing the eigenvectors and eigenvalues on the covariance matrix (or correlation matrix if the original variables are in different scales).

The first few eigenvectors corresponding to the largest eigenvalues can then be used to project our data points to the new space (rotating).

There are different ways to detect anomalies. Some of the common methods include:

- Using key statistics such as the InterQuartiler Range (IQR), which is the difference between the 3rd quartile (75% percentile) and 1st quartile (25% percentile), and for any points that are outside of Q3 + 1.5*IQR or Q1–1.5*IQR are considered outliers. This method works really well for a single variable and can be programmatically implemented, e.g., checking the daily number of rows for a certain table for a data engineering ETL pipeline.
- We can also collect labels for outliers and non-outliers and treat this as a binary classification problem
- In a semi-supervised learning fashion, we can use a model to predict outliers, having human experts manually check some of the labels and then retrain the model to optimize based on the feedback.
- Time series-based method: the idea is that if we build a good time series forecast model, and at some points, the predicted values are far away from the actual values, those can be considered outliers. Facebook research’s
**Prophet**is an advanced open-source library that detects outliers based on this idea.

[ad_2]

Source link