This post is an overview on a project proposed by Udacity, for the Data Science Nanodegree. The idea is to predict which users, from a simulated music streaming app, will churn. In this project, the definition of a churned user will be users that downgraded, from paid to free, or canceled their plan.
Predicting churn rates can be very valuable company. Knowing the users that are more likely to cancel or downgrade, the company can take measures to prevent it, not to mention that the exploratory analysis alone, made during the modeling and analysis of the problem, will show weaknesses in the product.
To solve this problem, we have a variety of information about users, such as which pages they visited in the app, how much time they spent in each page, which songs they listened to, the time they spent watching ads, how long they have been registered in the app, etc. This data will be treated and engineered to serve as input to machine learning models, that will predict which users will churn.
In the image below we can see the first 20 rows of the datase, where each row represents an user interaction in the app. Not all the features were used in the final model, and some new ones were engineered. This will be discussed in the present topic.
The dataset being used is composed by 528005 rows, with data from 448 unique users. As we can see in the plot below, most of the dataset is composed by rows with the value
NextSong in the feature
page , however, the time spent in some of the other pages are very relevant. This will become clearer in the next topics.
To continue the analysis, the next step will be to create the
label column, that will have the value 1 for rows where the user churned at any given moment, and 0 if they never churned.
In the image below we can see the proportion of churned users for each page. The columns
Cancellation Confirmation ,
Submit Downgrade and
Cancel were used to define the churned users, so they will always have the label 0 and will not be used as input features to the model. The features
Home , and
Settings will also be removed to simplify the model, as they are the closest ones to 50% label distribution.
The plot below shows that there is a significant difference in labels distribution across genders, hence it will be used as a feature.
This dataset was already pretty clean, however, there were some rows missing the feature
userId . Since we are trying to predict which users will churn, a row without an user identification is useless, hence they were removed from the dataset.
I also engineered some other features, such as how many songs per hour each user listens to, time spent watching ads and paid/free time ratio. All the other values for the
page feature were turned into new columns. The image below shows a sample of the final dataframe used in the model, with a total of 21 features.
To create and train the models I used the Python API for Spark (pyspark), in local mode, with its Machine Learning libraries. Spark is a framework to process larger datasets in clusters, usually in cloud services, like the AWS EMR. In this project I used a reduced version of the dataset, with 236mb. The complete version of the dataset has 12gb and is available only to be used in AWS.
After assembling the dataframed showed in the last image we need a few more steps in the preprocessing. We have to transform the feature columns into Vectors with the
VectorAssembler, scale all the features into the [0–1] range with
MinMaxScaler , choose a classifier from
pyspark.ml.classification and choose some parameters to build a parameter grid with the
ParamGridBuilder. All of this is transformed in a pipeline with
pyspark.ml.Pipeline . Pipelines makes the code easier to work and understand. They also help to parallelize the tasks when using distributed computing and help to avoid mistakes, like leaking data from training to test sets.
The image below shows a part of the code that builds the models, with the 4 classifiers tested (Random Forest, Multilayer Perceptron, Logistic Regression and Gradient-Boosted Trees), and the parameters used in the grid search.
To evaluate the performance of the prediction model, we will split the data between training and test sets, with distinct users in each set. This means that we will use a group of users to build and train the model and the other to test the performance. This way we can simulate a real situation, where the model predict churn for users the model had not seen before. To evaluate the performance we will use the F1 score, that combines two important metrics in classification problems, precision and recall. The formula of the F1 score can be seen below:
There were also a train/validation split (70/30) to train the models and choose the parameters, using
Since the dataset has many rows for each user, and the model makes a prediction for each row, we have to decide on a method to define the prediction for each user. I experimented with the two methods: using the most recent prediction and using the most frequent prediction. The second one showed best results and is the one used to generate the results in the next section.
The image below shows the F1 Score and accuracy for the best model in each classifier. The scores are based on the prediction for each user, not for each row, as I believe it is a better way to measure the success of the model in a real application.
The logistic regression had the worst performance, since is the simplest model experimented and can only capture linear dependencies. The MLP was only a bit better, but could be improved be improved with a deeper analysis, with more iterations, different kind of layers and many other techniques from deep learning, that would be out of the scope of this project. It was my first time experimenting with GBT and I was impressed about the training time, that was much faster than MLP, with better result. The best classifier for this problem was the Random Forest, that had the highest scores since the first tests.
The results show that even simple models can perform way better than random chance to predict which users will churn, giving companies a valuable info, that can be worked to keep users subscribed.
A lot of improvements can be made in this model, such as more complex machine learning models, live training and predicting of churn, automated incentives for users more likely to churn, more data from user to be used as input features, such as a periodic rating of the app from the user, etc.
To me, the most interesting aspect of this project was the objective, churn prediction seems like a tool that can be applied in a wide range of businesses and has a lot of potential to be further developed.
The repository of the project is available at: https://github.com/ttozatto/sparkify