[ad_1]

## Improving ThymeBoost’s Efficiency

The ThymeBoost framework, at its core, is simply a gradient boosting algorithm wrapped around standard time series methods. This means that the framework relies *heavily *on the underlying method’s efficiency and speed. The boosting and additional logic on-top adds, as we will see in this article, accuracy but computation as well. Most of this heavy lifting has been previously achieved through StatsModels for things such as ETS and ARIMA, but utilizing Nixtla’s statistical forecasting package: StatsForecast, can increase **both **speed and accuracy. Making ThymeBoost and StatsForecast a perfect marriage for Time Series Forecasting.

A good **TLDR **for this article is:

StatsForecast is **faster **than StatsModels, ThymeBoost brings **accuracy **gains.

## Introduction

First thing’s first, if you have not heard of ThymeBoost then I encourage you to check out my previous article giving a decent overview. With the newest release I have added the StatsForecast as an optional dependency. In order to run these examples you will need to install it:

`pip install StatsForecast`

And go ahead and update ThymeBoost just to be safe:

`pip install ThymeBoost --upgrade`

Now that we have that out of the way — the main ‘meat and potatoes’ of this article is going to be benchmarking on the Weekly M4 dataset to see how all of these models are performing in both accuracy and speed. The datasets are all open source and live on the M-competitions github. It is split up by the standard train and test splits, so we will use the train csv for fitting and the test csv only for evaluation using the SMAPE.

Feel free to test this out with other datasets and let me know how they perform!

The main goal of this is to review how the new methods stack up in the boosting framework and, ultimately, to see how adding them to the ThymeBoost framework can provide accuracy gains.

## Benchmarking The Methods

To start off, we will try out the most computationally heavy method in ThymeBoost: AutoArima. Previously done with PmdArima, now we can test with StatsForecast by simply passing `trend_estimator=‘fast_arima’`

when fitting with ThymeBoost. Let’s take a look at some code where we first build our dataset then we can run ThymeBoost:

`from tqdm import tqdm`

from statsforecast.models import ETS, AutoARIMA

from ThymeBoost import ThymeBoost as tb

tqdm.pandas()

train_df = pd.read_csv(r'm4-weekly-train.csv')

test_df = pd.read_csv(r'm4-weekly-test.csv')

forecast_horizon = len(test_df.columns) - 1

train_df = train_df.rename({'V1': 'ID'}, axis=1)

train_long = pd.wide_to_long(train_df, ['V'], 'ID', 'Date')

test_df = test_df.rename({'V1': 'ID'}, axis=1)

test_df = pd.wide_to_long(test_df, ['V'], 'ID', 'Date')

train_long = train_long.dropna()

train_df = train_long.reset_index()

train_df.index = train_df['ID']

train_df = train_df.drop('ID', axis = 1)

X = train_long

X = X.reset_index()

*Note: this code is probably very inefficient at data manipulation and I am sure there are better ways to do it, this was just something I threw together that works for the benchmark. The timing does not include the time it takes to run this code.*

Either way, now we have our Training data to fit on, let’s take a look at the fit function:

`def grouped_forecast(df):`

y = df['V'].values

boosted_model = tb.ThymeBoost(verbose=0)

output = boosted_model.fit(y,

seasonal_period=None,

trend_estimator=['fast_arima'])

predicted_output = boosted_model.predict(output,

forecast_horizon,

trend_penalty=True)

predictions = predicted_output['predictions']

return predictions

Here we are just creating a function that will be passed when we do a groupby and apply:

def counter(df):

df['counter'] = np.arange(2, len(df) + 2)

return df

predictions = X.groupby('ID').progress_apply(grouped_forecast)

predictions = predictions.reset_index()

predictions = predictions.groupby('ID').apply(counter)

test_df = test_df.reset_index()

benchmark_df = predictions.merge(test_df, left_on=['ID', 'counter'],

right_on=['ID', 'Date'])def smape(A, F):

return 100/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))

tqdm.pandas()

def grouped_smape(df):

return smape(df['V'], df['predictions'])

test = benchmark_df.groupby('ID').progress_apply(grouped_smape)

print(np.mean(test))

Then we just get the average SMAPE for the given outputs, everything here should be good but let me know if there are any errors which would muddy the benchmark.

Running this will give you an average SMAPE value of **8.61** and it should take roughly 10 minutes.

Next, let’s just run Nixtla’s Auto Arima by itself and see how it performs.

We will just change that groupby forecast function to:

`def grouped_forecast(df):`

y = df['V'].values

ar_model = AutoARIMA().fit(y)

predictions = pd.DataFrame(ar_model.predict(forecast_horizon)['mean'],

columns=['predictions'])

return predictions

Re-running the SMAPE calculation chunk above will give you a SMAPE of **8.93** and a time of roughly 4 minutes.

Alright, great, so we have shown some accuracy gains by just boosting the Auto-Arima procedure. This should come as no shock as I showed very similar results in an article deep diving Gradient Boosted Arima. But I do want to caveat that boosting is not a Panacea and does not always improve upon Arima, but it is still an interesting observation.

The next step should be obvious. We have taken a look at the ‘fast’ Auto-Arimain ThymeBoost as well as StatsForecast’s Auto-Arimawithout boosting. Next we should see how these stack up to using PmdArima’s Auto-Arimain ThymeBoost.

If you have been running this code up until now, buckle up.

This next bit will take some time…

`def grouped_forecast(df):`

y = df['V'].values

boosted_model = tb.ThymeBoost(verbose=0, n_rounds=None)

output = boosted_model.fit(y,

seasonal_period=None,

trend_estimator=['arima'],

arima_order='auto')

predicted_output = boosted_model.predict(output,

forecast_horizon,

trend_penalty=True)

predictions = predicted_output['predictions']

return predictions

And the results?

A SMAPE of **8.78**, but it took 90 minutes. Looks like boosting Pmd Arima outperforms Nixtla’s StatsForecast out of the box but it takes quite awhile.

Arima is not all of the offerings in StatsForecast, another implementation is an ETS method. With these new methods we can actually utilize these faster implementations in ThymeBoost’s `autofit`

method. To do this we just need to pass `fast=True`

when calling autofit. A new forecast function would then look like this:

def grouped_forecast(df):

y = df['V'].values

boosted_model = tb.ThymeBoost(verbose=0, n_rounds=None)

output = boosted_model.autofit(y,

seasonal_period=[52],

optimization_type='grid_search',

optimization_strategy='holdout',

lag=26,

optimization_metric='smape',

verbose=False,

fast=False

) predicted_output = boosted_model.predict(output,

forecast_horizon,

trend_penalty=True)

predictions = predicted_output['predictions']

return predictions

This results in a SMAPE of 7.88 and it takes about 80 minutes. Definitely the best plug-and-play accuracy out of everything tested but we are kind of cheating by doing model selection.

One thing to note is that passing a seasonal length of 52 to StatsForecast’s methods is not a great idea. For ETS it errors and for Auto-Arima it takes way too long. This is one area where taking advantage of how ThymeBoost works actually *increases* speed as long seasonal periods take significantly longer in an ARIMA setup.

Several other methods were tested and you can view the benchmark results below:

In terms of the acronyms:

- TB: ThymeBoost
- SF: StatsForecast
- NS: Non-Seasonal
- Mult: Multiplicative Seasonality
- Fast: ThymeBoost utilizing StatsForecast under the hood

At a high level, the best performing is the Fast AutoFit method from ThymeBoost. For some odd reason, fitting ThymeBoost with seasonality and fast Arima does not perform too well, in fact it is significantly worse than using PmdArima’s Auto-Arima. Another observation is that boosting plain ETS methods from StatsForecast may hurt the accuracy over just normal fitting with non-boosting methods. This may change if we change the `global_cost`

parameter in the fit function as the default may not be optimal all of the time.

## Conclusion

The newest version of ThymeBoost has some added capability to bring in StatsForecast’s methods. With this we can see increased speed and potentially accuracy over the previous implementation.

Like a good cake, ThymeBoost needs to have a good batter as a base. StatsForecast may be that superior batter over StatsModels. The gradient boosting is just the sprinkles on top.

If you enjoyed this article, you can check out some other time-series related posts I have written:

[ad_2]

Source link