The finance team of the Rossmann Pharmaceuticals wanted to forecast sales in all their stores across several cities six weeks ahead of time. Managers in individual stores rely on their years of experience as well as their personal judgement to forecast sales.
However, the data team identified factors such as promotions, competition, school and state holidays, seasonality and locality as necessary for predicting the sales across the various stores. In this project, we built and served an end to end product that delivers prediction to analysts in the finance team.
The data sets with a sufficient description of the features can be found here.
The most important analysis to understand the data is EDA. In order to conduct a good EDA, its a must to clean the data first. This process involves building pipelines to detect and handle outlier and missing data. This is particularly important because we don’t want to skew our analysis.
As shown in the pictures below, after reading the given datasets, we looked for missing column values and their corresponding percentages. Then missing data were filled with the respective column’s median value. After that outlier detection took place, and the outliers spotted were replaced. Finally we merged the store dataset, which contains important information about individual stores, with the training and testing datasets.
EDA — Customer purchasing behavior
Exploratory data analysis is the lifeblood of every meaningful machine learning project. It helps us unravel the nature of the data and sometimes informs how you go about modelling. A careful exploration of the data encapsulates checking all available features, checking their interactions and correlation as well as their variability with respect to the target.
In the data exploration phase, we conducted a check for distribution of promotions, compared sales behavior and looked for any seasonal purchase behaviors. We also inferred that there is a high correlation between sales and number of customers. The effects of assortment type and competition distances were analyzed as depicted in the pictures below.
Prediction of store sales using machine learning and DEEP LEARNING approach was the central task of this project. We want to predict daily sales in various stores up to weeks ahead of time. To effectively do this, we first preprocessed the data into a format where it can be fed into a machine learning model. we also generated new essential features like ‘season’ and ‘week day’, and scaled the data. The output of this preprocessing is shown below.
Building models with sklearn pipelines
A reasonable starting point will be to use a random forest regressor, and for working with sklearn pipes. This makes modeling modular and more reproducible. working with pipelines will also significantly reduce workload when moving a setup into files for the next part of the project.
The next step is choosing loss functions. loss functions indicate how well a model is performing. This means that the loss functions affect the overall output of sales predicition. Different loss functions have different usecases. For this project we used rmse(root mean square error), mae(mean absolute error) and R-squared scores.
* RMSE of 0.53 means the data is somehow concentrated around the line of best fit, the model can relatively predict the data accurately.
* Adjusted R-squared value of 0.72 which is greater than the acceptable value(0.4) is also a good value for showing accuracy.
Finally, we made some post prediction analysis, and saved the model named with the timestamp as serialized model.
Deep learning techniques can be used to predict various outcomes including but not limited to future sales. In this project we created a deep learning model of the LSTM(Long Short Term Memory) which is a type of Recurrent Neural network. According to google, and (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes.
To build a LTSM regression model and predict the next sale, the following tasks are necessary.
- Isolation into time series data and checking weather the data is stationary or not — Using the ltsm_helper script prepared prior, the data was prepared for the modeling and scaled. Using the line plot of sales vs date, it was shown that the data is stationary.
2. Transformation of the time series data into supervised learning data by creating new y(target) column using the sliding window for time series.
3. Preparing the model and predicting one step ahead, (next sale)
The model was trained using the above EPOCH size, and we created a method to reset all of the weights in case we want to re-train with different parameters. The history object stores model loss over the epoch, which can be plotted to evaluate whether an adjustment is needed in the training process. And then by setting the window size to 45, and splitting into 80/20 train/test data, the LSTM model was initialized.
This is the final step where we predected the next sale as presented in the following plot.
git clone https://github.com/Amanuel3065/pharmaceutical_sales_prediction.git
sudo python3 setup.py install