Long-Term Forecasting using Transformers may not be the way to go
In recent years, Transformer-based solutions have been gaining incredible popularity. With the success of BERT, GTP, and other language transformers researchers started to apply this architecture to other sequential-modeling problems, specifically in the area of time series forecasting (also known as Long-Term Time Series Forecasting or LTSF). The attention mechanism seemed to be a perfect method to extract some of the long-term correlations present in long sequences.
However, researchers from the Chinese University of Hong Kong and the International Digital Economy Activity recently decided to question: Are Transformers Effective for Time Series Forecasting ? They show that self-attention mechanisms (even with positional encoding) can result in temporal information loss. They then validate this claim with a set of one-layer linear models which outperform the transformer benchmarks in almost every experiment.
In simpler terms, Transformers may not be the most ideal architecture for forecasting problems.
In this post, I aim to summarize the findings and experiments of Zeng et al.  that lead to this conclusion and discuss some potential implications of the work. All the experiments and models developed by the authors can be found in their GitHub repository as well. Additionally, I highly encourage everyone to read the original paper.
The Models and Data
In their work, the authors evaluated 5 different SOTA Transformer models on the Electricity Transformer Dataset (ETDataset) . These models and some of their main features are as follows:
- LogTrans : Proposes convolutional self-attention so local context can be better incorporated into the attention mechanism. The model also encodes a sparsity bias into the attention scheme. This helps improve the memory complexity
- Informer : Addresses memory/time complexity and error complexity issues caused by an auto-regressive decoder by proposing a new architecture and a direct-multi-step (DMS) forecasting strategy.
- Autoformer : Applies a seasonal-trend decomposition behind each neural block to extract the trend-cyclical components. Additionally, Autoformer designs a series-wise auto-correlation mechanism to replace vanilla self-attention.
- Pyraformer : Implements a novel pyramidal attention mechanism that captures hierarchical multi-scale temporal dependencies. Like LogTrans, this model also explicitly encodes a sparsity bias into the attention scheme.
- FEDFormer : Enhances the traditional transformer architecture by incorporating seasonal-trend decomposition methods into the architecture, effectively developing a Frequency-Enhanced Decomposed TransFormer.
These models all make various changes to various pieces of the transformer architecture to address various different problems with traditional transformers (a full summary can be found in figure 1)
To compete against these transformer models, the authors proposed some “embarrassingly simple” models  that perform DMS predictions.
These models and their properties are:
- Decomposed Linear (D-Linear): D-Linear uses a decomposition scheme to split raw data into a trend and seasonal component. Two single-layer linear networks are then applied to each component and the outputs are summed to get the final prediction.
- Normalized Linear (N-Linear): N-Linear first subtracts the input by the last value of the sequence. The input is then passed into a single linear layer and the subtracted part is added in before making a final prediction. This helps address distribution shifts in the data.
- Repeat: Just repeat the last value in the look-back window.
These are some very simple baselines. The Linear models both involve a small amount of data preprocessing and a single-layer network. The Repeat is a trivial baseline.
The experiments were performed with various widely-used datasets like the Electricity Transformer (ETDataset), Traffic, Electricity, Weather, ILI, and Exchange Rate  datasets.
On the 8 models above, the authors performed a series of experiments to evaluate the models’ performances and determine the impact of various components of each model on the end predictions.
The first experiment was straightforward: each model was trained and used to forecast the data. The look-back periods were varied as well. The full testing results can be found in table 1 but in summary, FEDFormer  was the best-performing transformer in most cases but was never the overall best performer.
This embarrassing performance of transformers can be seen in the predictions for the Electricity, Exchange-Rate, and ETDataset in figure 3.
Quoting the authors:
Transformers [28, 30, 31] fail to capture the scale and bias of the future data on Electricity and ETTh2. Moreover, they can hardly predict a proper trend on aperiodic data such as Exchange-Rate. These phenomena further indicate the inadequacy of existing Transformer-based solutions for the LTSF task.
Many would argue however that this is unfair to transformers as attention mechanisms are usually good at preserving long-range information so Transformers should perform better with longer input sequences, and the authors test this hypothesis in their next experiment. They vary the look-back period between 24 and 720 time steps and evaluate the MSE. The authors found that in many cases, the performance of the transformers did not improve and the error actually increases for a few models (view figure 4 for full results). In comparison, the performance of the Linear models significantly improved with the inclusion of more time steps.
There are still other factors to consider, however. Due to the complexity of transformers, they often require larger training data sets than other models in order to perform well and as a result, the authors decided to test whether or not training data size is a limiting factor for these transformer architectures. They leveraged the Traffic data  and trained Autoformer  and FEDformer  on the original set as well as a truncated set with the expectation that the errors will be higher with the smaller training set. Surprisingly, the models trained on the smaller training set performed marginally better. While this doesn’t mean that one should use a smaller training set, this does mean that data set size is not a limiting factor for LTSF Transformers.
Along with varying the training data size and look-back period size, the authors also experimented with varying what timesteps the lookback window started at. For example, if they were attempting to make a prediction for the period after t=196, instead of using t = 100, 101,…, 196 (the adjacent or “close” window) the authors tried using t = 4, 5,…, 100 (the “far” window). The idea is that forecasting should depend on whether the model can capture trend and periodicity well and the farther the horizon is, the worse the prediction should be. The authors discovered that the performance of the transformers only drops slightly between the “close” and “far” windows. This implies that the transformers may be overfitting to the provided data, which could explain why the Linear models perform better.
After evaluating the various transformer models, the authors also dived specifically into the effectiveness of self-attention and embedding strategies used by these models. Their first experiment involved disassembling existing transformers to analyze whether or not the complex design of the transformer was necessary. They broke the attention layer down into a simple linear layer, then removed auxiliary pieces apart from the embedding mechanisms, and finally reduced the transformer down to only linear layers. At each step, they recorded the MSE using various look-back period sizes and found that the performance of the transformer grows with the gradual simplification.
The authors also wanted to examine the impact of the transformers to preserve temporal order. They hypothesized that since self-attention is permutation-invariant (ignores order) and time series are permutation-sensitive, positional encoding and self-attention will not be enough to capture temporal information. To test this, the authors modified the sequences by shuffling the data and exchanging the first half of the input sequence with the second half. The more temporal information is captured by the model, the more the performance of the model should decrease with the modified sets. The authors observed that the linear models had a higher performance drop than any of the transformer models, suggesting that the transformers are capturing less temporal information than the linear models. The full results can be found in the table below
To further dive into the information-capturing capabilities of transformers, the authors examined the effectiveness of different encoding strategies by removing positional and temporal encoding from the transformers. These results were mixed depending on the model. For FEDFormer  and Autoformer , removing positional encoding improved the performance on the Traffic dataset on most look-back window sizes. However, Informer  did perform the best when it had all its positional encodings.
Discussion and Conclusion
There are a few points to be careful of when understanding these results. Transformers are very sensitive to hyperparameters and often require a lot of tuning to effectively model the problem. However, the authors do not perform any kind of hyperparameter search when implementing these models, instead opting to use the default parameters used by the implementation of the models. There is an argument to be made that they also did not tune the linear models, so the comparison is fair. Additionally, tuning the linear models would take significantly less time than training the transformers due to the simplicity of the linear models. Despite this, there could be problems where transformers work incredibly well with the right hyperparameters, and cost and time can be ignored for accuracy.
Despite these critiques, the experiments done by the authors detail a clear breakdown of the flaws of transformers. These are large, very complex models that overfit easily on time series data. While they work well for language processing and other tasks, the permutation-invariant nature of self-attention does cause significant temporal loss. Additionally, a linear model is incredibly interpretable and explainable compared to the complicated architecture of a Transformer. If some modifications are made to these components of LTSF Transformers, we may see them eventually beat simple linear models or tackle problems linear models are bad at modeling (for example change point identification). In the meantime, however, data scientists and decision-makers should not blindly throw Transformers at a time-series forecasting problem without having very good reasons for leveraging this architecture.
Resources and References
 A. Zeng, M. Chen, L. Zhang, Q. Xu. Are Transformers Effective for Time Series Forecasting? (2022). Thirty-Seventh AAAI Conference on Artificial Intelligence.
 S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y. Wang, X. Yan. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting (2019). Advances in Neural Information Processing systems 32.
 H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (2021). The Thirty-Fifth AAAI Conference on Artificial Intelligence, Virtual Conference.
 H. Wu, J. Xu, J. Wang, M. Long. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting (2021). Advances in Neural Information Processing Systems 34.
 S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A.X. Liu, S. Dustdar. Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting (2021). International Conference on Learning Representations 2021.
 T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting (2022). 39th International Conference on Machine Learning.
 G. Lai, W-C. Chang, Y. Yang, and H. Liu. Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks (2017). 41st International ACM SIGIR Conference on Research and Development in Information Retrieval.