[ad_1]

Hello, everyone! Let’s try to apply *K-Nearest Neighbors *(KNN) — which is one of the Machine Learning algorithms — for Customer Segmentation on the Online Retail Dataset. We will know each customer segmentation’s characteristics using RFM Analysis. Before we go to the analysis, let’s see **why is it important to segment the customers**.

Customer segmentation describes the **process of identifying groups or segments of a company’s customers who have similar characteristics or factors.**

The goal of this segmentation is **to optimize marketing for each segment. **Customer segmentation is important and vital for

**Optimizing marketing strategies**,**Maximizing customer value**to the business, and**Improving customer satisfaction and experience**.

Grouping customers and prospects into customer segments with similar characteristics will help businesses to identify their target customer base. **That way, a business’s marketing strategies can be effective and appropriate **(not offensive, efficient, and relevant). So knowing customer segmentation **will not only save time and money but will enhance the benefits as well.**

Now that you understand why it’s important to segment your customers, let’s get started. This project will use the online retail dataset in Kaggle.

Let’s explore this dataset first before we deep dive into the analysis! But, import the library that we need first.

import math

import numpy as np

import pandas as pd

import matplotlib.pyplot as pltimport seaborn as sns

import scipy as shc

from scipy.cluster.hierarchy import dendrogram, linkagefrom sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report, confusion_matrixfrom imblearn.over_sampling import SMOTE

After that, read the dataset. The dataset can be downloaded in online retail dataset.

`df = pd.read_csv("online-retail/Online_Retail.csv", encoding='windows-1254')`

df

Data that we used have 8 columns/features, including InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. Before we use this dataset, let’s check for null/missing values.

`df.isna().sum()`

Because we want to group each customer, so row that has a null customerID should be deleted.

`df.dropna(inplace=True)`

After that process, the total rows will decrease from 541909 to 401604 rows. Let’s check if is there an outlier here or not. If the dataset has outliers we will replace them with columns mean.

def detect_outlier(data):

outliers = [] threshold = 3

mean = np.mean(data)

std = np.std(data) for y in data:

z_score= (y - mean)/std

if np.abs(z_score) > threshold:

outliers.append(y)

return outliersfor item in num_cols:

if(item != 'CustomerID'):

mean = np.mean(df[f'{item}'])

print(f'Outliers {item} : {mean}')

outliers = detect_outlier(df[f'{item}'])

df[f'{item}'].replace(outliers, mean, inplace=True)

print(outliers)

The code above explains that if the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. A normal distribution is shown below and it is estimated that

- 68% of the data points lie between ±1 standard deviation.
- 95% of the data points lie between ±2 standard deviation
- 99.7% of the data points lie between ± 3 standard deviation

Before we jump to modeling, we try using RFM analysis segmentation for our retail. Let’s see how that it works.

RFM (*Recency, Frequency, and Monetary) *is a key customer trait. These metrics indicate the behavior of customers because the frequency and monetary affect a customer’s lifetime value and recency which affect retention.

Because of that, the RFM analysis is a marketing technique used to quantitatively rank and group customers based on the recency, frequency, and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns. All three of these measures have proven to be effective predictors of a customer’s willingness to engage in marketing messages and offers.

Conducting an RFM analysis on our customer base and sending personalized campaigns to high-value targets has massive benefits for our eCommerce store.

**Personalization**: By creating effective customer segments, you can create relevant, personalized offers.**Improve Conversion Rates**: Personalized offers will yield higher conversion rates because our customers are engaging with products they care about.**Improve unit economics****Increase revenue and profits**

Let’s customize the data into recency, frequency, and monetary column!

`df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])`

df['Diff'] = max(df['InvoiceDate']) - df['InvoiceDate']

recency = df.groupby('CustomerID')['Diff'].min()

recency = recency.dt.days

recency = recency.reset_index()

`frequency = df.groupby('CustomerID')['InvoiceNo'].count()`

frequency = frequency.reset_index()

frequency

`df['Amount'] = df['Quantity'] * df['UnitPrice']`

monetary = df.groupby('CustomerID')['Amount'].sum()

monetary = monetary.reset_index()

monetary

After we calculate recency, frequency, and monetary, we can merge into one dataframe.

`rfm = pd.merge(recency, frequency, on='CustomerID')`

rfm = pd.merge(rfm, monetary, on='CustomerID')

rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

rfm

I gave an outlier z-score threshold of 3.0 because I saw that the outliers were still normal and could be used in this modeling because the data made sense for retail transactions. To overcome this, we use scaling on this dataset.

Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one. The standard score of sample x is calculated as:

z = (x — u) / s

from sklearn.preprocessing import StandardScaler

rfm_scaled = rfm[['Recency', 'Frequency', 'Monetary']]

rfm_scaled = StandardScaler().fit_transform(rfm_scaled)

rfm_scaled = pd.DataFrame(rfm_scaled)

rfm_scaled.columns = ['Recency', 'Frequency', 'Monetary']

rfm_scaled

After we calculate RFM values, we group those values for each group. For grouping, I use Elbow Method to find the optimum k. Basically, the elbow method shows the best *‘K’ *after the graph looks when forming the elbow and after that value the inertia score is constant.

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_scoredistortions = []

range_n_clusters = range(1, 10)

for num_cluster in range_n_clusters :

kmeans = KMeans(n_clusters=num_cluster)

kmeans.fit(rfm_scaled)distortions.append(kmeans.inertia_)plt.figure(figsize=(16,8))

plt.plot(range_n_clusters, distortions, 'bx-')

plt.xlabel('k')

plt.ylabel('Distortion')

plt.title('The Elbow Method showing the optimal k')

plt.show()

The graph below shows the explanation and correlation between RFM with class/segments. As an example, class 2 has not most visiting recently, but they have the highest total purchase for the transaction and often come to retail, we can call it a loyal customer.

Distribution each feature with 3D visualization for K-Means Clustering (n_cluster is 3). The distribution of the class is

- Class 0: 3264
- Class 1: 1096
- Class 2: 12

fig = plt.figure(figsize=(20, 20))

ax = plt.axes(projection='3d')

for cluster in result_df['Cluster'].unique() :

result_cluster = result_df.loc[result_df['Cluster'] == cluster]

ax.scatter(result_cluster['Recency'], result_cluster['Frequency'], result_cluster['Monetary'], label=cluster ,s=100, alpha=0.5, cmap='winter')ax.set_xlabel('Recency', fontsize=18)

ax.set_ylabel('Frequency', fontsize=18)

ax.set_zlabel('Monetary', fontsize=18)ax.legend()

plt.show()

Based on the result, we have three segments for our customers in online retail. Almost 75 % dominated with a segment class 0 called [**Slipping — Once Loyal, Now Gone]***. *They have **high potential value** because almost our customers spread at this level. We can reach their customers with many strategies for being a **[Require Activation]** customer and moreover **[Whales — Your Most Loyal and Highest Paying Customers]** customer.

We can make an extensive analysis of this customer segmentation based on the above-given chart.

0 [Slipping — Once Loyal, Now Gone]: Great past customers who haven’t bought in a while.

**Marketing Strategies:**Customers leave for a variety of reasons, but they have high potential. We can reach out for many promotions, share about our daily needs, and more action like a direct approach with their social media, email, etc. Depending on your situation price deals, new product launches, or other retention strategies. Engage this segment is a must to get more high revenue as we know our customers dominated in this segment.

1 [Requires Activation / Rookies — Your Newest Customers]:

**Marketing Strategies:**This is segmentation with the highest recency, which means the average last visit of this customer is almost 1 year ago. It would be so hard to reach this segmentation. But, we can try with an introduction what our core business, having clear strategies in place for first-time buyers such as triggered welcome emails will pay dividends.

2 [Whales — Your Most Loyal and Highest Paying Customers]:

**Marketing Strategies:**These customers have demonstrated a high willingness to pay. They are like a superior customer, not most visiting recently, but they have the highest total purchase for the transaction. Consider premium offers, subscription tiers, luxury products, or value add cross/up-sells to increase AOV. Don’t waste margin on discounts.

**Conclusion**

The same project using RFM analysis can be extended and can be implemented into real-time data, which will be very helpful for the markets, and commerce companies. Due to time limitations and also data, I could do the basic segmentation.

[ad_2]

Source link