[ad_1]

## Segmentation Using K-Means clustering algorithm.

Business Context

Businesses are always in the process of devising methods to segment their customers. **The segmentation process ensures that the business can create consumer-specific strategies and create a product or service that suits their needs. **This is a MUST-DO activity before running any online marketing campaign.

Customer Segmentation is a **popular application of unsupervised learning**. In this, clusters are used by the company to define and place its customers in different groups which are categorized on the basis of **region, gender, age, preferences,** and so on.

Problem Statement

Assume that we have an organization that selling a portion of the item, and you need to realize how well does the selling execution of the item.

You have the data that can we analyze, but what kind of analysis that we can do? Well, we can segment customers based on their buying behavior on the market.

Keep in mind that the data is really huge, and we can not analyze it using our bare eye. We will use machine learning algorithms and the power of computing for it. This article will show you how to cluster customers on segments based on their behavior using the K-Means algorithm in Python.

I hope that this article will help you on how to do customer segmentation step-by-step from preparing the data to cluster it.

Before we get into the process, I will give you a brief on what kind of steps we will get.

**Gather the data****Create Recency Frequency Monetary (RFM) table****Manage skewness and scale each variable****Explore the data****Cluster the data****Interpret the result**

In this step, we will gather the data first. For this case, we will take the data from UCI Machine Learning called Online Retail dataset.

The dataset itself is a transactional data that contains transactions from December 1st 2010 until December 9th 2011 for a UK-based online retail.

Each row represents the transaction that occurs. It includes the product name, quantity, price, and other columns that represents ID.

You can access to the dataset from here.

Here is the size of the dataset.

`(541909, 8)`

For this case, we don’t use all of the rows. Instead, we will sample 10000 rows from the dataset, and we assume that as the whole transactions that the customers do.

The code will look like this,

# Import The Libraries

# ! pip install xlrd

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np# Import The Dataset

df = pd.read_excel('dataset.xlsx')

df = df[df['CustomerID'].notna()]# Sample the dataset

df_fix = df.sample(10000, random_state = 42)

Here is the glimpse of the dataset,

After we sample the data, we will make the data easier to conduct an analysis.

To segmenting customer, there are some metrics that we can use, such as when the customer buy the product for last time, how frequent the customer buy the product, and how much the customer pays for the product. We will call this segmentation as RFM segmentation.

To make the RFM table, we can create these columns, such as Recency, Frequency, and Monetary Value column.

To get the number of days for recency column, we can subtract the snapshot date with the date where the transaction occurred.

To create the frequency column, we can count how much transactions by each customer.

Lastly, to create the monetary value column, we can sum all transactions for each customer.

The code looks like this,

# Convert to show date onlyfrom datetime import datetime

df_fix["InvoiceDate"] = df_fix["InvoiceDate"].dt.date# Create TotalSum colummn

df_fix["TotalSum"] = df_fix["Quantity"] * df_fix["UnitPrice"]# Create date variable that records recency

import datetime

snapshot_date = max(df_fix.InvoiceDate) + datetime.timedelta(days=1)# Aggregate data by each customer

customers = df_fix.groupby(['CustomerID']).agg({

'InvoiceDate': lambda x: (snapshot_date - x.max()).days,

'InvoiceNo': 'count',

'TotalSum': 'sum'})# Rename columns

customers.rename(columns = {'InvoiceDate': 'Recency',

'InvoiceNo': 'Frequency',

'TotalSum': 'MonetaryValue'}, inplace=True)

Here is the glimpse of the dataset,

Right now, the dataset consists of recency, frequency, and monetary value column. But we cannot use the dataset yet because we have to preprocess the data more.

We have to make sure that the data meet these assumptions, they are,

The data should meet assumptions where the variables are not skewed and have the same mean and variance.

Because of that, we have to manage the skewness of the variables.

Here are the visualizations of each variable,

As we can see from above, we have to transform the data, so it has a more symmetrical form.

There are some methods that we can use to manage the skewness, they are,

**log transformation****square root transformation****box-cox transformation**

Note: We can use the transformation if and only if the variable only has positive values.

Below are the visualization each variable and with and without transformations. From top left clockwise on each variable shows the plot without transformation, log transformation, square root transformation, and box-cox transformation.

Based on that visualization, it shows that the variables with box-cox transformation shows a more symmetrical form rather than the other transformations.

To make sure, we calculate each variable using the skew function. The result looks like this,

**variable, without, log, sqrt, box-cox transformations**

Recency, 14.77, 0.85, 3.67, 0.16

Frequency, 0.93, -0.72, 0.32, -0.1

Here is how to interpret the skewness value. If the value is close to 0, the variable tend to have symmetrical form. However, if it’s not, the variable has skew on it. Based on that calculation, we use variables that use box-cox transformations.

Based on that calculation, we will utilize variables that use box-cox transformations. Except for the Monetary Value variable because the variable includes negative values. To handle this variable, we can use cubic root transformation to the data, so the comparison looks like this,

By using the transformation, we will have data that less skewed. The skewness value declines from 16.63 to 1.16. Therefore, we can transform the RFM table with this code,

`from scipy import stats`

customers_fix = pd.DataFrame()

customers_fix["Recency"] = stats.boxcox(customers['Recency'])[0]

customers_fix["Frequency"] = stats.boxcox(customers['Frequency'])[0]

customers_fix["MonetaryValue"] = pd.Series(np.cbrt(customers['MonetaryValue'])).values

customers_fix.tail()

It will look like this,

Can we use data right now? Not yet. If we look at the plot once more, each variable don’t have the same mean and variance. We have to normalize it. To normalize, we can use StandardScaler object from scikit-learn library to do it. The code will look like this,

# Import library

from sklearn.preprocessing import StandardScaler# Initialize the Object

scaler = StandardScaler()# Fit and Transform The Data

scaler.fit(customers_fix)

customers_normalized = scaler.transform(customers_fix)# Assert that it has mean 0 and variance 1

print(customers_normalized.mean(axis = 0).round(2))# [0. -0. 0.]print(customers_normalized.std(axis = 0).round(2))# [1. 1. 1.]

The data will look like this,

Finally, we can do clustering using that data.

Solution Developed

Right after we preprocess the data, now we can focus on modelling. To make segmentation from the data, we can use the K-Means algorithm to do this.

K-Means algorithm is an unsupervised learning algorithm that uses the geometrical principle to determine which cluster belongs to the data. By determine each centroid, we calculate the distance to each centroid. Each data belongs to a centroid if it has the smallest distance from the other. It repeats until the next total of the distance doesn’t have significant changes than before.

To implement K-Means in Python is easy. We can use the KMeans function from scikit-learn to do this.

To make our clustering reach its maximum performance, we have to determine which hyperparameter fits to the data. To determine which hyperparameter is the best for our model and data, we can use the elbow method to decide. The code will look like this,

from sklearn.cluster import KMeanssse = {}

for k in range(1, 11):

kmeans = KMeans(n_clusters=k, random_state=42)

kmeans.fit(customers_normalized)

sse[k] = kmeans.inertia_ # SSE to closest cluster centroidplt.title('The Elbow Method')plt.xlabel('k')

plt.ylabel('SSE')

sns.pointplot(x=list(sse.keys()), y=list(sse.values()))

plt.show()

Here is the result,

How to interpret the plot? The x-axis is the value of the k, and the y-axis is the SSE value of the data. We will take the best parameter by looking at where the k-value will have a linear trend on the next consecutive k.

Improvements to the Solution

Based on our observation, the k-value of 3 is the best hyperparameter for our model because the next k-value tend to have a linear trend. Therefore, our best model for the data is **K-Means with the number of clusters is 3**.

Now, We can fit the model with this code,

`model = KMeans(n_clusters=3, random_state=42)`

model.fit(customers_normalized)

model.labels_.shape

By fitting the model, we can have clusters where each data belongs. By that, we can analyze the data.

We can summarize the RFM table based on clusters and calculate the mean of each variable. The code will look like this,

`customers["Cluster"] = model.labels_`

customers.groupby('Cluster').agg({

'Recency':'mean',

'Frequency':'mean',

'MonetaryValue':['mean', 'count']}).round(2)

The output from the code looks like this,

Besides that, we can analyze the segments using **snake plot**. It requires the normalized dataset and also the cluster labels. By using this plot, we can have a good visualization from the data on how the cluster differs from each other. We can make the plot by using this code,

# Create the dataframedf_normalized = pd.DataFrame(customers_normalized, columns=['Recency', 'Frequency', 'MonetaryValue'])

df_normalized['ID'] = customers.index

df_normalized['Cluster'] = model.labels_# Melt The Data

df_nor_melt = pd.melt(df_normalized.reset_index(),

id_vars=['ID', 'Cluster'],

value_vars=['Recency','Frequency','MonetaryValue'],

var_name='Attribute',

value_name='Value')

df_nor_melt.head()# Visualize it

sns.lineplot('Attribute', 'Value', hue='Cluster', data=df_nor_melt)

And here is the result,

By using this plot, we know how each segment differs. It describes more than we use the summarized table.

We infer that cluster 0 is frequent, spend more, and they buy the product recently. Therefore, it could be the cluster of a **loyal customer**.

Then, the cluster 1 is less frequent, less to spend, but they buy the product recently. Therefore, it could be the cluster of **new customer**.

Finally, the cluster 2 is less frequent, less to spend, and they buy the product at the old time. Therefore, it could be the cluster of **churned customers**.

In conclusion, customer segmentation is really necessary for knowing what characteristics that exist on each customer. The article has shown to you how to implement it using Python. I hope that this article will be useful to you, and you can implement on your case.

Link to the working project

If you want to see how the code is, you check on this Google Colab here.

[1] Daqing C., Sai L.S, and Kun G. *Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining *(2012), Journal of Database Marketing and Customer Strategy Management*.*

[2] Millman K. J, Aivazis M. *Python for Scientists and Engineers *(2011), Computing in Science & Engineering*.*

[3] Radečić D. *Top 3 Methods for Handling Skewed Data** *(2020), Towards Data Science.

[4] *Elbow Method for optimal value of k in KMeans*, Geeks For Geeks.

[ad_2]

Source link