Testing a generalist model on a ridiculously hard problem
In a previous article, I tried to debunk the somewhat diffuse idea that a bunch of models (each specialized on a subset of a dataset) should perform better than a single model.
To do that, I took a portion of a dataset (e.g. only the American customers) and trained a model on that group, a.k.a. a specialized model. Then, I trained a second model on the whole dataset (i.e. on all the customers, regardless of their nationality), a.k.a a general model. In the end, I compared the performance of the two models on a holdout set made only of observations belonging to the group.
I repeated this procedure on multiple datasets and on multiple groups of the same dataset, for a total of 600 comparisons. There was a clear winner: the general model.
However, my experiment did not convince some people, who argued that my approach was oversimplified. For instance, this was one of the most liked comments under my Linkedin post about the article:
This comment intrigued me, so I decided to follow the suggestion. If you are curious to see how it turned out, bear with me.
In my previous article, I demonstrated that there is a clear benefit in using a general model over specialized models when there is some similarity across the groups composing the dataset.
However, as the groups become more and more different from each other, it is reasonable to expect that the benefit of using a general model gets smaller and smaller. In the most extreme case, i.e. when the groups are completely different from each other, the difference between the two approaches should equal zero.
If my intuition is correct, we could sketch this relationship as follows:
But this is just my hypothesis. So let’s try to put it to the test.
Our goal is to answer this question:
What happens if the groups composing a dataset are completely different from each other, and we still use a general model?
So, the point becomes how to simulate such a scenario.
The most extreme idea is to “glue” different datasets together. And when I say “different”, I mean datasets that have not only different columns, but even different tasks, i.e. they aim to predict different things.
Let’s take three datasets for instance:
- “bank”: each row is a bank’s customer, the task is to predict whether he/she will subscribe a term deposit;
- “employee”: each row is an employee, the task is to predict whether he/she will leave the company;
- “income”: each row is a person, the task is to predict whether his/her income is above 50k $.
Gluing together the target variables of these datasets is easy: since they are all binary variables made by 0s and 1s, this is straightforward. But the situation becomes more complicated when we try to concatenate the features. Let me explain why.
Here is a sample (both rows and columns) of the three datasets.
As you can see, the datasets have different columns. So, how can we merge them together? The first, most naive idea is to use pd.concat
:
pd.concat([X_bank, X_employee, X_income]).reset_index(drop=True)
But, if we did that, we would obtain a dataframe of the following form:
Pandas by default concatenates only the columns that have the same name. In this case, each dataset has different column names, therefore the result has a diagonal-like structure. But this is not satisfactory, because it would allow the model to cut corners. Indeed, the model would be able to implicitly discern one dataset from the other, based on the columns that are not null.
To avoid that, we need a way to “force” the merge of the columns of the different datasets.
The only way I could think of is renaming the columns of each dataset with a progressive number: “feature_01”, “feature_02”, etc. But that wouldn’t work because the columns have different types. So we need to make a distinction between categorical and numerical features: “cat_feature_01”, “cat_feature_02”, etc. and “num_feature_01”, “num_feature_02”, etc. Moreover, I decided to sort the features by decreasing importance.
This is the resulting output:
Maybe you are thinking this is not enough. After all, the model may still recognize some categories that belong to a given dataset (for example, “married” in column “cat_feature_01” exists only in the “bank” dataset). The same goes for numerical features (for example, values between 0 and 1 in column “num_feature_02” exist only in the “employee” dataset). This can still be helpful for the model, and we want to avoid that.
Thus, as an additional step, I:
- mapped each value of each categorical feature to a different integer (ordinal encoding);
- standardized the numerical columns of each original dataset by subtracting their mean and dividing by their standard deviation.
So, this is the ultimate result:
I know you may think that this procedure —artfully sticking together some totally unrelated datasets — is a bit odd. You are right: what we are doing would make no sense in a real world setting.
But you must keep in mind that this is a didactical experiment to push the capabilities of a general model to its limits, and see if it can still be competitive with specialized models.
This experiment must be intended as a sort of “stress test” of the capabilities of tree-based gradient boosting models.
Now that we have designed a strategy, it’s time to apply it to some real datasets. I have used 7 datasets for binary classification with more than 5,000 rows that are available in Pycaret (a Python library under MIT license).
These are the datasets, with the respective number of rows and columns:
Then, I applied the procedure described above, which means that I performed the following actions for each dataset separately:
- I have renamed each categorical column (sorted in decreasing order of importance) as “cat_feature_01”, “cat_feature_02”, … and each numerical column (sorted in decreasing order of importance) as “num_feature_01”, “num_feature_02”, …;
- for each categorical column, I have mapped every value into a distinct integer: 0, 1, 2, …;
- for each numerical column, I have standardized the values by subtracting their mean and dividing by their standard deviation;
- I have added a column containing the dataset name.
Then, I merged all the original datasets to obtain the final dataset. At this point, I proceeded with the experiment. This consisted in:
- training a general model (Catboost, with no parameter tuning) on the full merged dataset;
- training 7 specialized models (Catboost, with no parameter tuning), one on each original dataset;
- compare the performance of the general model and the specialized model on each dataset.
The first thing that I noticed looking at the results is that the correlation between the predictions made by the general model and the predictions made by the specialized models is 98%, indicating that they produce very similar output.
But what about performance?
Here is a comparison of the ROC scores of the general model versus the specialized models:
The mean difference between the general model’s ROC score and the specialized model’s ROC score is -0.53%. This means that the specialized models generally outperformed the general model.
However, I must say I am impressed by how tiny the difference is. We made a test in a ridiculously hard setting, and still, the general model was able to achieve performance very close to the specialized models. This is evidence of how effective a general model is, even on this insanely difficult problem.
Another concern that I have heard about general models is their alleged lack of explainability. In fact, some people claim that a single general model is less transparent than many specialized models.
I don’t agree with this point. In fact, thanks to SHAP values, you can explain each group separately from the others, even if the model is unique. We could call this process “specialized explainability”.
Let me give an example, using our previous experiment.
If we take each group separately and compute the correlation coefficient between the original feature values and the respective SHAP values, this is what we obtain:
As you can see, the correlation coefficients change a lot across the groups. For instance, if we take “num_feature_01” the correlation is positive for the group “bank”, whereas it is negative for the group “employee”. This makes a lot of sense, in fact:
- For the group “bank”, “num_feature_01” corresponds to the feature “duration”, which is how long that person has been an account holder. The target feature is whether the client subscribed a term deposit. It is reasonable to expect a positive impact of the feature on the prediction.
- For the group “employee”, “num_feature_01” corresponds to the feature “satisfaction_level”. Since the target feature is whether the employee has left, the negative correlation is easily explained.
In this article, I simulated the most difficult scenario for a general model: a case in which the groups composing the dataset are completely different from each other.
To simulate this situation, I merged some datasets that had nothing to do with each other, nor the features, and not even the prediction task! I have used a trick to make sure that the columns of the different datasets were concatenated together even if they had different names.
Then, I trained a general model on the merged dataset and many specialized models: one for each original dataset.
This was a stress test, to see what would happen in a ridiculously hard situation for the general model. Nevertheless, I found out that the difference in performance is minimum: 0.53% average loss in ROC score using a general model instead of specialized models.
Moreover, I used the experiment to show why explainability should not be a concern either. In fact, after using a general model, one can still explain the single groups separately through “specialized explainability”.