Artificial intelligence is undoubtedly a hot topic. The success of tools such as ChatGPT, Midjourney, Stable Diffusion, and many others has left many people interested in studying and understanding how A.I. works. As a result, many beginners have started their own machine learning journey.
With its impressive performance in text, audio, image, and video tasks, deep learning and neural networks have captured the attention of enthusiasts and beginners alike. They often jump straight into the deep learning wagon as soon as they start their studies on machine learning, trying to apply neural networks to the simplest of regression and classification tasks.
But is that really necessary?
The following papers, Tabular Data: Deep Learning is Not All You Need (2021) and Why do tree-based models still outperform deep learning on tabular data? (2022), tested a few deep learning models on tabular data to compare their performance against tree-ensemble models, like XGBoost and RandomForest.
Let’s see what they found out!
Before delving into the results achieved by the papers, we first need to understand what exactly is tabular data.
We may refer as tabular data any kind of data that is structured within a table with rows and columns. For instance, in a house price prediction dataset, each house will be represented by a row – a sample – and its attributes will be organized among columns, which contain information on that specific house.
A lot of data in Finance, Healthcare, Housing, and others are organized in a table containing rows and columns, hence, these data are what we call tabular data.
For dealing with this kind of data, we have many algorithms and methods, such as decision trees, ensemble learners, logistic regression, linear regression, support vector machines, etc. Whereas, deep learning is used for data that are not tabular, such as pictures and audio.
Both academic papers mentioned above have tested tree-ensemble models against deep learning models on different datasets. Tabular Data: Deep Learning is Not All You Need (2021) focused on both classification and regression tasks, while Why do tree-based models still outperform deep learning on tabular data? (2022) approached only classification task.
In the Why do tree-based models still outperform deep learning on tabular data? (2022) paper, XGBoost, RandomForest and GradientBoostingTrees were compared against MLP, Resnet, FT_Transformer, and SAINT deep learning models.
The images above show the accuracy of the models on the validation set of different datasets across different random iterations. It’s possible to see that, for datasets with only numerical features, as well as those with both numerical and categorical features, the tree-ensemble models outperform the deep learning models.
The study also highlights the fact that each random iteration is generally slower for the neural networks than for the tree-ensemble models, which is another factor of disadvantage for this type of approach when dealing with tabular data.
In the Tabular Data: Deep Learning is Not All You Need (2021) paper, the authors put the XGBoost against the TabNet, NODE, DNF-NET, and 1D-CNN deep learning models, while highlighting the 1D-CNN model as the one that have achieved the best single model performance in a Kaggle competition with tabular data.
The study also compares the performance of a simple ensemble model (SVM and CatBoost), an ensemble of the deep learning models alone, and an ensemble of the deep learning models with the XGBoost.
The experiment was made on 11 tabular datasets, containing both classification and regression tasks. And these are the results.
The models were evaluated on the cross-entropy loss for binary classification tasks, while the root-mean-square error was used to evaluate models on regression tasks.
Overall, it was concluded that the XGBoost model outperforms the deep learning models on most datasets. Beyond that, there was not a single deep learning model that consistently outperformed the other models, and each deep learning model was better only on the datasets that were tested in its on paper.
Even though the ensemble of deep learning models and XGBoost consistently outperformed the other models, including XGBoost alone, it was concluded that the XGBoost model alone would be the easiest to optimize and the faster to converge, which would be a relevant advantage under tight time constraints.
Without a doubt, neural networks are exciting, and deep learning has been allowing us to perform tasks that were unimaginable a few years ago on non-tabular data. However, when dealing with tabular data, it turns out that the more “traditional” machine learning models may be faster and achieve better results than deep learning.
It’s also important to mention that there is no such thing as a “holy grail” in this industry. There is not a single model that beats every other model in any particular task. Testing, fine-tuning, validating, and making changes to see what works best for the problem at hand, then repeating this process over and over again, is a part of being a data scientist.
Even though the papers above suggest that, for now, there is no particular reason to jump right away into deep learning for solving tabular-data tasks, it’s indispensable to try different methodologies in your work to see which provides the best result.
It’s also relevant to note that studies on neural networks are advancing rapidly, and it may be a matter of time until we have a deep learning model that can consistently beat XGBoost on tabular data. The game is still on!
Thank you for reading,
Luís Fernando Torres