This story will be divided into several sections to simplify the case:
- Problem Understanding
- Exploratory Data Analysis
- Data Preprocessing
- Machine Learning Model Development
- Business Insights & Recommendations
After getting the overview of what will be explained in this story, let us start with the most common business issue: Problem Understanding.
We don’t like to admit it but…
Many startups are competing to increase their volume and scale-up growth in various ways, whereas as businessmen, we should be pursuing profit. To understand more, we need to investigate what is really happening lately.
We came up by seeing the great potential of e-grocery data to produce a 36% increase in CR (2020 vs 2021) and of course, we can use this to generate profits through increased conversion rates. Moreover, a phenomenon that has emerged recently is the Start-up bubble with one of the causes: too much dependence on investors. So we need to think about how to make our e-commerce ecosystem profitable and survive in the rampant competition.
Our company only has a 15.63% conversion rate in the whole year 2021. Therefore, we have to increase the company’s conversion rate up to 20% and gross profit up to 15% in 2023 by :
- Predict whether visitors will convert or not using predictive modeling
- Analyze the factors that affect the increase in conversion rate
2.1 Dataset Overview
Before we start to analyze, we need to understand our data first. We have a dataset that consists of 12,330 session data in the last 1-year period. Each session interprets a collection of user activity in a web/platform that lasts for a certain period. The dataset consists of 18 features which are divided into 10 numerical features, 7 categorical features, and 1 target feature.
For further understanding of each feature we have, our data dictionary will be explained below:
- Administrative: Number of pages visited by the visitor about account management.
- Administrative Duration: Total amount of time (in seconds) spent by the visitor on account management-related pages.
- Informational: Number of pages visited by the visitor about a Website, communication, and address information of the shopping site.
- Informational Duration: Total amount of time (in seconds) spent by the visitor on informational pages.
- Product Related: Number of pages visited by visitors about product-related pages.
- Product Related Duration: Total amount of time (in seconds) spent by the visitor on product-related pages.
- Bounce Rate: Average bounce rate value of the pages visited by the visitor.
- Exit Rate: Average exit rate value of the pages visited by the visitor.
- Page Value: Average page value of the pages visited by the visitor.
- Special Day: Closeness of the site visiting time to a special day.
- Operating Systems: Operating system of the visitor.
- Browser: Browser of the visitor.
- Region: Geographic region from which the session has been started by the visitor.
- Traffic Type: Traffic source by which the visitor has arrived at the Web site / Platform.
- Visitor Type: Visitor type as ‘‘New Visitor,’’ ‘‘Returning Visitor,’’ and ‘‘Other’’.
- Weekend: Boolean value indicating whether the date of the visit is a weekend.
- Month: Month value of the visit date.
- Purchase: Class label indicating whether the visit has been finalized with a transaction.
1.2 Descriptive Analysis
We conduct some exploration to gain quick insight into the dataset. This section will describe several insight gained.
From the chart above, we know that direct visits and google organic dominate the 18 types of traffic, it shows that visitor intentions and recommendations from google are still very relevant.
From the chart above, we know that majority of our visitors come from big cities and the top 3 are South Tangerang, Depok, and Jakarta. Therefore, to be able to increase our conversion rate, we can prioritize existing resources to improve, promote, and advertise in Kota Tangerang Selatan, Depok, and Jakarta.
In this section, we require to tidy our data before we start to develop a machine learning model. Several data pre-processing processes that are required, are described below:
- Handling Missing Data: This dataset is clean, there are no missing values.
- Handling Duplicate Data: At this stage, we drop 125 duplicate data.
- Feature Encoding: At this stage, we use One Hot Encoder to encode 5 categorical features and produce 41 additional features that have been encoded. So, now we have a total of 52 features (after dropping 5 original categorical features).
- Data Transformation: At this stage, we use Log Transformation to transform our skew distributed numerical feature into a more normally distributed. After that, we use Min Max Scaler to normalize our numerical data in the range 0–1.
- Handling Outlier: At this stage, we use the z-score method to remove extreme outlier data.
- Feature Selection: At this stage, we eliminate irrelevant features such as page values and special days. Then, we eliminate features that have minor contributions to model development later using several methods, there filter method and embedded method.
The filter methods we use are quasi constant, chi-square, univariate selection, and mutual information. While the embedded method that we use is the lasso method. So at this stage, we managed to select 16 features from 51 features that will be used in the machine learning modeling stage.
- Handling Imbalance Target: For the last preprocessing stage, we use SMOTE method for handling the imbalance target feature with a default sampling strategy of 1:1.
As explained earlier, one of our objectives is to increase the conversion rate by predicting whether visitors will purchase or not. Therefore, we try to develop a machine learning model using 10 classification algorithms to get the best performance algorithm, with the F2 Score as our main metrics evaluation. We choose the F2 Score as a metrics evaluation because we want to prioritize avoiding false negative predictions, but without completely ignoring false positive predictions. In this case, we define the 0 value in the target feature as a positive label.
Based on the results above, we decided that LGBM is the best algorithm to interpret and extract the feature importance using SHAP values.
Based on the results of feature importance above, we know that the top 4 features that most influence a visitors’ intention to buy are exit rates, administrative, product-related duration, and product-related. Besides that, we also get insights into the visitors’ purchase intention tendency, such as the smaller the exit rate, the more visitors will tend to buy the product, likewise the other features tendency.
After that, we re-modeled our LGBM algorithm using the top 4 important features and tuning its hyperparameter. Based on the re-modeling results, we decide to make it the final model with a 94.3% Train F2 Score and 90.6% Test F2 Score.
Based on the results of the analysis and modeling that we have done, we have compiled several business recommendations that companies can consider.
5.1 Discount Offering to Real-Time Non-Purchased Predicted Visitors
In the first recommendation, using the prediction model we developed earlier, we recommend companies to use this model to provide real-time discount promotions to visitors who are predicted not to buy by the workflow below.
Then, to explain further the recommendations impact, we try to perform simulations with assumptions:
- Average COGS: 100K / purchase
- Gross Profit Margin: 65%
- Discount Promotion Effectiveness Level: 10%
- Number of visitors/year: 100K visitors
Based on our simulation results, we found that the company has the potential to increase their conversion rate from 15.63% up to 23.32% and increase their gross profit up to 17% by offering discount to real-time non-purchased predicted visitors.
5.2 Root Cause Analysis
Before explaining further recommendations, we compiled a summary of the analysis results that we have done with a root cause analysis, this method helps us to explain the causes of the low conversion rate based on the top 4 important features. From each feature, we describe the causes of problems for each feature based on phenomena and problems that often occur in the industry.
There are several causes of the low conversion rate based on the top 4 important features, as follows:
Based on our root cause analysis results, we need to deep analyze for each feature to get the most appropriate recommendations.
5.2.1 Exit Rates Feature Analysis
We conduct an exploration to gain quick insight into the exit rates feature for purchased and non-purchased visitors.
Based on SHAP Values, Exit Rates are the most important feature in influencing a visitor’s decision to make a purchase or not. Exit Rates is the number of average exit rates on a website/platform. We use the median to find out the middle value of Exit Rates, the median Exit Rates for visitors who make purchases are lower than those who do not, which is 1.6%. While the median for people who don’t make purchases is 2.8%. It can also be seen from the graph above that many visitors who do not make purchases have higher exit rates.
For now, we have overall median exit rates of 2.5%. It will be huge if we have to reduce the exit rates from 2.5% to 1.6%. Therefore, we perform the sensitivity analysis to know the exit rates target that is effective to achieve.
We perform sensitivity analysis to determine the effect of increasing or decreasing exit rates on conversion rates. This experiment is carried out by decreasing and increasing the feature value every 5% with the experimental results as the graph on the left. Based on these results, we consider to reduce 15% of Exit Rates which will have an effect on increasing the conversion rate by 0.6%. There are a lot of alternatives that we can do to reduce exit rates, based on our root cause analysis earlier, the following are recommendations:
Based on the root cause analysis, we can provide recommendations that are in accordance with the problems that may occur, then provide solutions that are commonly used by practitioners with its pros and cons to make it easier for stakeholders to make decisions. For example, one of the causes of high exit rates is poor UI/UX, we recommend improving UI/UX and testing to see the results. And of course, this method has its advantages and disadvantages as listed in the table.
5.2.2 Administrative Feature Analysis
We conduct an exploration to gain quick insight into the administrative feature for purchased and non-purchased visitors.
This feature describes how many administrative pages are visited in each session like account management. Taking extreme values into account, the median of administrative is 2 pages for people who make purchases and for people who don’t make purchases is 0 pages (which means not visiting this page at all). It can also be seen from the graph above that many visitors who do not make purchases have higher exit rates.
For now, we have overall median administrative page visited of 1. It will not be huge if we have to increase the administrative page visited from 1 to 2. However, we perform the sensitivity analysis to know the administrative target that is effective to achieve.
We did the same experiment for this feature. By considering effectiveness and cost, we decided to set a rational target, which is an increase of 5% from the average number of administrative page visits, so that the conversion rate can be expected to increase to 12.7%. There are a lot of alternatives that we can do to increase administrative pages visited, based on our root cause analysis earlier, the following are recommendations:
Recommendation that can be implemented is to conduct an analysis to look for deficiencies in the administrative pages’ user interface and improve the sections that have been validated as deficiencies in order to make visitors visit more administrative pages.
5.2.3 Product-Related Feature Analysis
We conduct an exploration to gain quick insight into the product-related feature for purchased and non-purchased visitors.
Product-related is the number of pages visited by visitors related to the product. For the median itself, visitors who make purchases visit 16 pages and those who do not make purchases are 29 pages. It can also be seen from the graph above that many visitors who do not make purchases have higher exit rates.
For now, we have overall median product-related of 18 pages. It will not be huge if we have to decrease the product-related from 18 pages to 16 pages. However, we perform the sensitivity analysis to know the product-related target that is effective to achieve.
We did the same experiment for this feature. We are trying to find the optimal reduction in the average number of product-related pages to increase the conversion rate and we decided to decrease it by 10% and hopefully increase the conversion rate by 6.7%. There are a lot of alternatives that we can do to increase product-related pages visited, based on our root cause analysis earlier, the following are recommendations:
For this feature improvement, we have the poor photo quality of the product, so we recommend improving the photo quality by making standard product displays. This recommendation has quite an affordable price, but good photo quality will affect the page loading time.
5.2.4 Product-Related Duration Feature Analysis
We conduct an exploration to gain quick insight into the product-related duration feature for purchased and non-purchased visitors.
Product-related duration is the total amount of time in seconds spent by visitors on pages related to the product. The median value for visitors who make purchases is 18.4 minutes and for people who do not make purchases is 8.5 minutes. It can also be seen from the graph above that many visitors who do not make purchases have higher exit rates.
For now, we have overall median product-related duration of 9.9 minutes. It will be huge if we have to increase the product-related duration from 9.9 minutes to 18.4 minutes. Therefore, we perform the sensitivity analysis to know the product-related duration target that is effective to achieve.
We did the same experiment for this feature. Here we look for optimal product-related duration increases to increase conversion rates. We consider to increase 5% product-related duration which is expected to increase conversion rate by 0.6%. There are a lot of alternatives that we can do to increase product-related pages duration viewed, based on our root cause analysis earlier, the following are recommendations:
In this recommendation, one of the root causes of the low product-related duration is the visitor hard to understand the product features, we can overcome this by improving the platform’s layout by conducting research first and testing as well. The advantage is that this step is relatively inexpensive but it takes a long time to complete.
- We recommend to prioritize existing resources to improve, promote, and advertise in Kota Tangerang Selatan, Depok, and Jakarta for marketing purposes.
- Implementation of machine learning model for offering discount to real-time non-purchased predicted visitors has the potential to increase conversion rate from 15.63% up to 23.32% and increase company’s gross profit up to 17%.
- Among all the recommendations that we propose based on root cause analysis, we most recommend improving the UI/UX platform/website. This is because based on root cause analysis, improving the UI/UX platform/website can solve 3 existing causes, including high exit rates, low administrative pages visited, and high product-related pages visited.
- The projects above also could be implemented on other eCommerce companies (e.g. HappyFresh, Bukalapak, Lazada, etc.) to increase their conversion rate to achieve higher profitability.
Source Code : Here