In the vast landscape of data analysis and machine learning, anomalies are the enigmatic outliers that can hide game-changing insights or raise red flags about data quality. These anomalies, the data points that stand out from the crowd, often require a keen eye and a reliable tool to be detected accurately. Enter the Z-score, a formidable statistical instrument that empowers data analysts and machine learning practitioners alike.

In this blog post, we embark on a journey into the world of anomaly detection, guided by the practical prowess of Z-scores. We will demystify the concept behind Z-scores, illustrate their real-world applications, and provide you with a clear understanding of their pivotal role in the fields of data analysis and machine learning. Prepare to uncover the secrets of identifying anomalies with precision and confidence as we navigate the terrain of data anomalies armed with the formidable Z-score.

*Z-score is like a special number that tells us how far away something is from the usual or average thing. It’s like saying if you did something really well, just okay, or not so great compared to what most people do.*

Let’s take an example which can help us to understand this more clearly,

*Imagine in your class, if everyone usually gets about 70 on a test, and you got a 90, your Z-score would be high because you did way better than most people. But if you got a 50, your Z-score would be low because you did worse than most people. It helps us understand how different or similar something is compared to what’s typical.*

**Mathematical Formula**

Let’s understand using formula ;

z = (X — μ) / σ

Here’s what each component means:

- Z is the Z-score we’re calculating.
- X is the specific data point we want to evaluate.
- μ (mu) is the mean (average) of the dataset.
- σ (sigma) is the standard deviation, which measures how spread out the data is.

**Example: **Suppose you have data on the test scores of a group of students. The mean (average) score in the class is 60, and the standard deviation is 8. You want to find the Z-score for a student who scored 70 on the test.

Let’s put the value in z-score formula :

z = (X — μ) / σ = (Individual score — average score)/ standard deviation

= (70–60)/8 =10/8 = 1.25

So, the Z-score for the student who scored 70 on the test is 1.25.

*What does this Z-score means, which we find it in above example?*

*A Z-score of 1.25 means that the student’s score is 1.25 standard deviations above the class average. This suggests that the student performed better than most of their classmates, as their score is higher than the average by 1.25 standard deviations.*

*A threshold value is a predetermined limit or cutoff point that helps determine what is considered an anomaly or outlier within a dataset. It’s the point at which a Z-score is considered significant enough to label a data point as unusual or different from the rest.*

Typically, there are two common threshold values used when working with Z-scores:

*Z-Score Greater than 2 (or Less than -2):**This threshold suggests that data points with Z-scores greater than 2 or less than -2 are considered unusual or outliers. In other words, they are significantly different from the mean (average) of the dataset. This threshold is often used in practice for detecting moderate outliers.**Z-Score Greater than 3 (or Less than -3):**Using a threshold of Z-scores greater than 3 or less than -3 is a stricter criterion for identifying outliers. Data points that exceed this threshold are considered highly unusual and are typically reserved for identifying extreme outliers.*

The choice of threshold depends on the specific needs of your analysis and the level of sensitivity you want in detecting outliers or unusual data points. Using a higher threshold like 3 (or -3) will identify fewer data points as outliers, while a lower threshold like 2 (or -2) will flag more data points as potentially unusual.

In practice, it’s typically a good idea to experiment with different threshold values and assess the impact on your analysis to find the most suitable threshold for your particular dataset and objectives. The choice should be guided by the level of scrutiny required for identifying anomalies in your specific context.

As We have discussed every basics of Z-score, now it’s time to practice it to detect anomalies (outliers).

Now, let’s outline a practical approach to detecting anomalies using Z-scores:

## Step 1: Data Preparation

Begin by gathering and preprocessing your data. Ensure that your dataset is cleaned, missing values are handled, and any necessary feature engineering is performed.

## Step 2: Calculate Z-Scores

For each data point in your dataset, calculate its Z-score using the formula mentioned earlier. This step standardizes the data, making it suitable for comparison and anomaly detection.

## Step 3: Set a Threshold

Determine an appropriate threshold for anomaly detection. Commonly used thresholds are Z-scores greater than 2 or 3, which correspond to data points that are two or three standard deviations away from the mean. Adjust the threshold based on the specific requirements of your application.

## Step 4: Identify Anomalies

Any data point with a Z-score greater than the chosen threshold is considered an anomaly. These data points are significantly different from the rest of the dataset and warrant further investigation.

## Step 5: Visualize Anomalies

To gain a better understanding of the anomalies detected, visualize them in your dataset. Create plots or charts that highlight the anomalous data points, making it easier to interpret the results.

Let’s work with the Diabetes dataset from Scikit-learn. This dataset consists of ten baseline variables (age, sex, body mass index, average blood pressure, and six blood serum measurements) and a quantitative measure of disease progression one year after baseline.

We will calculate Z-scores for some features and visualize the data. Here’s how you can do it:

Firstly, we import the important library for solving this.

`import pandas as pd`

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_diabetes

Now, we will load the dataset, we select the features (age, bmi, bp) and we see the dataset of above 5 data records.

`# Load the Diabetes dataset`

diabetes = load_diabetes()

data = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)

data['TARGET'] = diabetes.target # Select a few features for Z-score calculation and visualization

selected_features = ['age', 'bmi', 'bp']

data.head(5)

After that, we will set the threshold value as 2, and also we find the means and standard deviation.

After that we will also calculate z-score by using above formula of z-score.

`# Set a threshold value for Z-score (absolute value greater than this is considered an outlier)`

zscore_threshold = 2# Calculate mean and standard deviation for selected features

means = data[selected_features].mean()

stds = data[selected_features].std()

# Calculate Z-scores for selected features

z_scores = ((data[selected_features] - means) / stds)

Now, we will detect the outliers and plot dataset with outliers and without outliers

`# Detect outliers based on the threshold`

outliers = (np.abs(z_scores) > zscore_threshold).any(axis=1)# Filter outliers from the dataset

filtered_data = data[~outliers]

# Visualize the original dataset

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)

plt.scatter(data['bmi'], data['TARGET'], color='blue', label='Original Data')

plt.xlabel('Body Mass Index (bmi)')

plt.ylabel('Disease Progression')

plt.title('Original Diabetes Dataset')

# Visualize the filtered dataset without outliers

plt.subplot(1, 2, 2)

plt.scatter(filtered_data['bmi'], filtered_data['TARGET'], color='green', label='Filtered Data')

plt.xlabel('Body Mass Index (bmi)')

plt.ylabel('Disease Progression')

plt.title('Diabetes Dataset without Outliers (Threshold = {})'.format(zscore_threshold))

plt.tight_layout()

plt.show()

Let’s analyze the graphs and discuss what conclusions we can draw and what actions can be taken regarding the outliers or anomalies detected in the Diabetes dataset.

**Graph 1: Original Diabetes Dataset**

*In the first graph, we plotted the relationship between the Body Mass Index (BMI) and Disease Progression. Here’s what we can observe from the original dataset:*

*Spread of Data: The data points are scattered across the graph, indicating a diverse range of disease progression values for various BMI values.**Outliers: There seem to be a few outliers where disease progression is notably higher for specific BMI values. These outliers are the points that fall far away from the main cluster of data points.*

**Graph 2: Filtered Diabetes Dataset (After Removing Outliers)**

*In the second graph, we visualized the dataset after removing outliers based on Z-scores. Here’s what we can conclude:*

*Reduced Outliers: The number of outliers has reduced in the filtered dataset. Outliers that were far away from the main cluster have been removed, leading to a cleaner and more focused dataset.**Impact on Analysis: Removing outliers can help in creating a more accurate model. Outliers can disproportionately influence statistical measures, and removing them can lead to a more representative analysis of the majority of the data points.*

*We conclude that :*

*Data Quality:**Outliers can often be caused by errors in data collection or measurement. It’s essential to assess the quality of the data source and, if possible, correct or remove the erroneous data points.**Model Accuracy:**Outliers can significantly impact the accuracy of machine learning models. Models trained on data with outliers might not perform well when applied to new, unseen data. Removing outliers can enhance model accuracy and generalizability.**Further Analysis:**After removing outliers, it’s important to reanalyze the dataset to draw meaningful insights. The relationships between variables might change, and new patterns might emerge in the absence of outliers.*

**What action we take on these outliers?**

*Let’s see what action we take on regarding outliers;*

*Understanding the Cause:**Investigate the cause of outliers. Are they genuine data points, or are they errors? Understanding the cause helps in deciding the appropriate action.**Imputation:**If outliers are genuine but extreme, consider imputing their values with more typical values from the dataset instead of removing them completely. This maintains data integrity while reducing the impact of extreme values.**Domain Expert Consultation**: Consult domain experts to determine the relevance of outliers. In some cases, outliers might carry crucial information about rare events or unique scenarios that are valuable for analysis.**Robust Models:**Use robust machine learning models that are less sensitive to outliers. Algorithms like Random Forests or robust regression techniques can handle outliers better than some other models.*

**Thank you for reading this blog and supporting me.**

*You can connect with me, I’m attaching my social media links below:*

*https://www.linkedin.com/in/akash-srivastava-1595811b4/*