This article summarizes the work on a data set which contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. Not all users receive the same offer, and that was the challenge to solve with this data set. Our task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type.
This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products. Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. Informational offers also have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, we can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
Rewards program users (17000 users x 5 fields)
- gender: (categorical) M, F, O, or null
- age: (numeric) missing value encoded as 118
- id: (string/hash)
Offers sent during 30-day test period (10 offers x 6 fields)
- reward: (numeric) money awarded for the amount spent
- channels: (list) web, email, mobile, social
- difficulty: (numeric) money required to be spent to receive reward
- duration: (numeric) time for offer to be open, in days
- offer_type: (string) bogo, discount, informational
- id: (string/hash)
Event log (306648 events x 4 fields)
- person: (string/hash)
- event: (string) offer received, offer viewed, transaction, offer completed
- value: (dictionary) different values depending on event type:
offer id: (string/hash) not associated with any “transaction”
amount: (numeric) money spent in “transaction”
reward: (numeric) money gained from “offer completed”
- time: (numeric) hours after start of test
There are three types of offers that can be sent: buy-one-get-one (BOGO), discount, and informational.
- In a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount.
- In a discount, a user gains a reward equal to a fraction of the amount spent.
- In an informational offer, there is no reward, but neither is there a requisite amount that the user is expected to spend.
Offers can be delivered via multiple channels.
Clean, transform and merge the data and find insights about the customer base such as age, income etc. and analysis of the final results.
First data cleaning was performed to make sure we have the data in the right format and then data analysis was done. Summary of data cleaning steps perfomed is mentioned below:
- Renamed “id” column to “offer_id”
- One hot encoded the “channel” column to have separate columns for “web”, “mobile”, “email” and “social”
- Changed the format of the “became_member_on” column to YYYY-MM-DD format
- 2175 entries in the profile data were missing gender and income information and for the rows in which gender and income is nan age is “118” which doesn’t make sense. Therefore, replaced all the age entries with “118” with nan
- Extracted “offer_id”, “amount” and “reward” information from the “value” column
- “reward” column only contains one unique value i.e. “2”. Therefore, dropped the “reward” column
- Renamed “person” column to “customer_id”
After cleaning of the data, merging of all the 3 datasets was performed. On the merged dataset EDA was done, below are the results of EDA:
Age: It was observed that the average age group of the customer base is ~54 years. Minimum age is 18 and maximum age is 101.
Income: It was observed that the average income of the customer base is ~$64,000.
Gender: Roughly ~50% of the customers are male.
2. Analysis on Offer data
From the below charts it can be inferred that although bogo offers were viewed more than the discount offers but discount offers were completed more as compared to bogo offers. This could be the case because the average validity of the discount offers was more than the bogo offers.
3. Breakdown of Offer Data with Offer type and channel, income group and age group
From the below chart it can be inferred that no particular offer was sent more than any other via any particular channel but it was also observed that bogo completes were less than discount completes even though bogo had been viewed more than the discount offer.
Breaking down the above charts on income group and age group gender wise, and from the below data we can infer that for:
- Males: For each income and age group discount offers were sent more than the bogo offers but bogo offers were viewed more than the discount offers. Also, even though the bogo offers were viewed more but discount offers were completed more than bogo offers
- Females: For almost every income and age group bogo offers were sent more than the discount offers and bogo offers were viewed more than the discount offers. Also, even though the bogo offers were viewed more but discount offers were completed more than bogo offers
- 50% of the customers are males and the average age of a customer is 54, average income is $64k.
- Almost same number of discount and bogo offers were sent to the customers but bogo offers were viewed more and discount offers were completed more
- Male customers were sent more discount offers but they viewed bogo offers more and completed more discount offers than the bogo offers
- Female customers were sent more bogo offers and they viewed bogo offers more but completed more discount offers than the bogo offers
- This trend of viewing more bogo offers but completing more discount offers was observed almost across all age and income groups
Since we had 3 datasets: Portpolio, Profile and Transcript. We needed a strategy to merge them and prepare a data so that we can build a predictive model.
Problem Statement for model: Predict whether a user will complete an offer or not based on their transactions and offer viewing behaviour
Data Preparation Steps:
- After removing the rows with nan values for customer demographics (income, age, gender etc.) we created 2 datasets from transcript data: a. For Offer Data, b. For transactions data
- We extracted year in which the customer became member from “became_member_on” column and converted the “time” column’s data to number of days by dividing it by 24
- Every offer has a duration (vaildity) upto which its valid and its provided to us in the portfolio data in the duration column (unit is number of days)
- Once the offer is sent to a customer, the customer has to spend the required amount of money to fulfill an offer and they have time till the end of the offer’s validity
- Therefore, we will prepare our data in a way that for each offer which was sent to a customer, we will check whether that customer viewed that offer within that offer’s duration/validity (built functions to perform this task)
- An offer will be successful if within an offer’s duration the user viewed the offer and completed it
- After getting the data in the required format- one hot encode the categorical features and create a train test split (20% data for test set)
We used Random Forest Classifier with default parameters at first and then used RandomSearchCV to tune the hyperparameters. The best score was found for the below mentioned set of hyperparameters:
And the model’s performance metrics were: Accuracy 0.72 and F1-score 0.72
The below chart represents the relative feature importance of all the features which were used in the training of the model. The top 4 features were:
1. Offer reward
2. Offer difficulty
3. Offer duration
4. Customer’s income
It can be concluded from the above analysis and modeling that an offer’s details are the most driving factors for a customer to complete an offer (reward, difficulty, duration etc.).
And it makes sense that bogo offers were viewed more but completed less than the discount offers because customers would prefer “buy one get one” offer more than just discount.
Please find the jupyter notebook and data for this analysis at the below link: