A model comparison using XGBoost, Random Forest and Prophet

A time series is a series of data points taken at successive equally spaced points in time, for example hourly data measurements, daily data measurements and yearly measurements. Examples of time series are temperature measured daily at a given point in time of the day, or daily sales or stock price of a given stock taken over a certain period.

Time series analysis is done to identify patterns in the data and predict for the future. Due to the advance of machine learning models and deep learning techniques there are many ways to perform a time series analysis.

- Classical statistical methods: e.g., moving averages, ARIMA, SARIMA
- Machine learning models: e.g., XGBoost, Random forest, Prophet
- Deep learning: e.g., LSTM, RNN

I would like to elaborate on few commonly used machine learning models used in time series analysis for regression, namely XGBoost, Random forest and Prophet. E**x**treme **G**radient **B**oosting or shorten form **XGB**oost is a Gradient boosting algorithm (Ref: https://xgboost.readthedocs.io/en/stable/tutorials/model.htm). Random forest is a decision tree algorithm, which combines the output of multiple decision trees to reach a single result. Most ensemble decision tree models like XGBoost and Random forest are supervised learning models. These two differ in the way the trees are build. XGBoost is a leaf first algorithm while Random forest is a level first algorithm. This means the XGBoost decision tree is built by spawning the leaf where the decision criteria is successful, and in Random Forest decision tree, each level is built before going to the next level.

Prophet from Facebook (Ref: https://facebook.github.io/prophet/docs/quick_start.html) is a time series regression algorithm, where the series is decomposed to the trend, seasonality and error components. Additionally Prophet is able to take in to consideration the holidays if given as an extra input.

**𝑋𝑡=𝑇𝑡+𝑆𝑡+𝐻𝑡+𝜖𝑡**

where,

- 𝑇𝑡 : trend component
- 𝑆𝑡 : seasonal component (weekly, yearly)
- 𝐻𝑡 : deterministic irregular component (holidays)
- 𝜖𝑡 : noise

XGBoost and Random forest can be used for both univariate (one X feature to predict target Y) and multivariate (many X features to predict target Y) datasets, but Prophet is limited to univariate datasets. However, Prophet is very simple and easy to use time series model which gives very good results.

In the case of XGBoost and Random forest in order to predict the Y value in time series using supervised learning methods, it requires the features to be formatted to a series where the feature matrix is built in a way that each time series value prediction will take x number of lags as the input. For example, if we consider 2 lags and if the time series y values are 1, 2, 3, 4, 5, 6, 7, 8 … then, if 1,2 are the given input then 3 is to be predicted. If 2, 3, are given as input then 4 is to be predicted, and if 3 , 4 are given then 5 to be predicted and so on.

So, our feature input matrix looks like: lag_2, lag_1 -> y value

1,2 -> 3

2,3 -> 4

3, 4 -> 5 …

Here is an example I have done using the Retail sales UCI public data sample (https://archive.ics.uci.edu/ml/datasets/Online+Retail). The data set has daily product sales from December 2010 to December 2011 for many product sales. I have read the dataset and taken only the InvoiceDate, UnitPrice and Quantity as shown below.

After reading the row dataset, I have aggregated the total sales per day, by taking total price per product by multiplying quantity with unit price and summing up daily sales. This daily sales is considered for the prediction y value. The date and total sales (TotalPrice in the table below) are my x and y values for the univariate dataset for the time series analysis. Unit price vs the date looks as below, which shows a slight trend and seasonality.

As the TotalPrice has larger values, I have converted them to log scale and taken the differentiated value of log(TotalPrice) as log_tp_diff to consider as the stationary record. I have added the log_tp_lag_2, log_tp_lag_1 in to the dataframe to create the supervised data matrix. The prepared dataset looks as follows:

**Train/test dataset:** I have removed December 2011 data and taken the November 2011 full month as the test data.

I consider ‘log_tp_diff_lag_2’ & ‘log_tp_diff_lag_1’ as the X features and ‘log_tp_diff’ as the Y value for XGBoost and Random Forest training using supervised learning method. Once the prediction is done, I have to reverse the stationarity and take log inverse to get the actual TotalSales value, as the prediction is the difference from the previous day sales (log_tp_diff).

For Prophet, I give ‘InvoiceDate’ as ds and ’TotalPrice’ as y, where the model decomposes the dataset to trend and seasonality without needing any conversions as in the previous case.

**Model comparison using error values:** The model error values are a good comparison on model performance, however further model tuning could improve the model performance.

Error comparison is as follows:

- XGBoost

2. Random forest

3. Prophet

The model prediction for test dataset comparison looks as below:

**Conclusion:** Each model prediction is done in a different way, however the errors are in the same range in the above case. One can tune the models and try to find the best model fitting the dataset by comparing the error on test data.

The code used in this analysis can be found here -> https://github.com/tde2020/MachineLearning/blob/main/TimeSeriesForecasting/RetailData-ModelComparison.ipynb

Welcome your comments or any questions related to time series analysis.

Thank you for reading this long… 🙂