Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

The Unreasonable Effectiveness of General Models | by Samuele Mazzanti | Jan, 2023

admin by admin
January 17, 2023
in Artificial Intelligence


Testing a generalist model on a ridiculously hard problem

[Image by Author, made with Excalidraw]

In a previous article, I tried to debunk the somewhat diffuse idea that a bunch of models (each specialized on a subset of a dataset) should perform better than a single model.

To do that, I took a portion of a dataset (e.g. only the American customers) and trained a model on that group, a.k.a. a specialized model. Then, I trained a second model on the whole dataset (i.e. on all the customers, regardless of their nationality), a.k.a a general model. In the end, I compared the performance of the two models on a holdout set made only of observations belonging to the group.

I repeated this procedure on multiple datasets and on multiple groups of the same dataset, for a total of 600 comparisons. There was a clear winner: the general model.

However, my experiment did not convince some people, who argued that my approach was oversimplified. For instance, this was one of the most liked comments under my Linkedin post about the article:

[Screenshot from the comment section of my Linkedin post]

This comment intrigued me, so I decided to follow the suggestion. If you are curious to see how it turned out, bear with me.

In my previous article, I demonstrated that there is a clear benefit in using a general model over specialized models when there is some similarity across the groups composing the dataset.

However, as the groups become more and more different from each other, it is reasonable to expect that the benefit of using a general model gets smaller and smaller. In the most extreme case, i.e. when the groups are completely different from each other, the difference between the two approaches should equal zero.

If my intuition is correct, we could sketch this relationship as follows:

A sketch of my working hypothesis. You can read the “previous article” here. [Image by Author, made with Excalidraw]

But this is just my hypothesis. So let’s try to put it to the test.

Our goal is to answer this question:

What happens if the groups composing a dataset are completely different from each other, and we still use a general model?

So, the point becomes how to simulate such a scenario.

The most extreme idea is to “glue” different datasets together. And when I say “different”, I mean datasets that have not only different columns, but even different tasks, i.e. they aim to predict different things.

Let’s take three datasets for instance:

  • “bank”: each row is a bank’s customer, the task is to predict whether he/she will subscribe a term deposit;
  • “employee”: each row is an employee, the task is to predict whether he/she will leave the company;
  • “income”: each row is a person, the task is to predict whether his/her income is above 50k $.

Gluing together the target variables of these datasets is easy: since they are all binary variables made by 0s and 1s, this is straightforward. But the situation becomes more complicated when we try to concatenate the features. Let me explain why.

Here is a sample (both rows and columns) of the three datasets.

Three example datasets: “bank”, “employee” and “income”. [Image by Author]

As you can see, the datasets have different columns. So, how can we merge them together? The first, most naive idea is to use pd.concat:

pd.concat([X_bank, X_employee, X_income]).reset_index(drop=True)

But, if we did that, we would obtain a dataframe of the following form:

First attempt: naive concatenation. [Image by Author]

Pandas by default concatenates only the columns that have the same name. In this case, each dataset has different column names, therefore the result has a diagonal-like structure. But this is not satisfactory, because it would allow the model to cut corners. Indeed, the model would be able to implicitly discern one dataset from the other, based on the columns that are not null.

To avoid that, we need a way to “force” the merge of the columns of the different datasets.

The only way I could think of is renaming the columns of each dataset with a progressive number: “feature_01”, “feature_02”, etc. But that wouldn’t work because the columns have different types. So we need to make a distinction between categorical and numerical features: “cat_feature_01”, “cat_feature_02”, etc. and “num_feature_01”, “num_feature_02”, etc. Moreover, I decided to sort the features by decreasing importance.

This is the resulting output:

Second attempt: renaming columns with a progressive number. [Image by Author]

Maybe you are thinking this is not enough. After all, the model may still recognize some categories that belong to a given dataset (for example, “married” in column “cat_feature_01” exists only in the “bank” dataset). The same goes for numerical features (for example, values between 0 and 1 in column “num_feature_02” exist only in the “employee” dataset). This can still be helpful for the model, and we want to avoid that.

Thus, as an additional step, I:

  • mapped each value of each categorical feature to a different integer (ordinal encoding);
  • standardized the numerical columns of each original dataset by subtracting their mean and dividing by their standard deviation.

So, this is the ultimate result:

Third and last attempt: ordinal encoding, standardization, and then renaming columns with a progressive number. [Image by Author]

I know you may think that this procedure —artfully sticking together some totally unrelated datasets — is a bit odd. You are right: what we are doing would make no sense in a real world setting.

But you must keep in mind that this is a didactical experiment to push the capabilities of a general model to its limits, and see if it can still be competitive with specialized models.

This experiment must be intended as a sort of “stress test” of the capabilities of tree-based gradient boosting models.

Now that we have designed a strategy, it’s time to apply it to some real datasets. I have used 7 datasets for binary classification with more than 5,000 rows that are available in Pycaret (a Python library under MIT license).

These are the datasets, with the respective number of rows and columns:

Pycaret datasets, with their shape. [Image by Author]

Then, I applied the procedure described above, which means that I performed the following actions for each dataset separately:

  • I have renamed each categorical column (sorted in decreasing order of importance) as “cat_feature_01”, “cat_feature_02”, … and each numerical column (sorted in decreasing order of importance) as “num_feature_01”, “num_feature_02”, …;
  • for each categorical column, I have mapped every value into a distinct integer: 0, 1, 2, …;
  • for each numerical column, I have standardized the values by subtracting their mean and dividing by their standard deviation;
  • I have added a column containing the dataset name.

Then, I merged all the original datasets to obtain the final dataset. At this point, I proceeded with the experiment. This consisted in:

  • training a general model (Catboost, with no parameter tuning) on the full merged dataset;
  • training 7 specialized models (Catboost, with no parameter tuning), one on each original dataset;
  • compare the performance of the general model and the specialized model on each dataset.
Procedure of comparing a general and a specialized model on a group of the dataset. [Image by Author]

The first thing that I noticed looking at the results is that the correlation between the predictions made by the general model and the predictions made by the specialized models is 98%, indicating that they produce very similar output.

But what about performance?

Here is a comparison of the ROC scores of the general model versus the specialized models:

ROC scores compared. [Image by Author]

The mean difference between the general model’s ROC score and the specialized model’s ROC score is -0.53%. This means that the specialized models generally outperformed the general model.

However, I must say I am impressed by how tiny the difference is. We made a test in a ridiculously hard setting, and still, the general model was able to achieve performance very close to the specialized models. This is evidence of how effective a general model is, even on this insanely difficult problem.

Another concern that I have heard about general models is their alleged lack of explainability. In fact, some people claim that a single general model is less transparent than many specialized models.

I don’t agree with this point. In fact, thanks to SHAP values, you can explain each group separately from the others, even if the model is unique. We could call this process “specialized explainability”.

Specialized explainability. [Image by Author]

Let me give an example, using our previous experiment.

If we take each group separately and compute the correlation coefficient between the original feature values and the respective SHAP values, this is what we obtain:

Correlation between each merged feature and the respective SHAP values. [Image by Author]

As you can see, the correlation coefficients change a lot across the groups. For instance, if we take “num_feature_01” the correlation is positive for the group “bank”, whereas it is negative for the group “employee”. This makes a lot of sense, in fact:

  • For the group “bank”, “num_feature_01” corresponds to the feature “duration”, which is how long that person has been an account holder. The target feature is whether the client subscribed a term deposit. It is reasonable to expect a positive impact of the feature on the prediction.
  • For the group “employee”, “num_feature_01” corresponds to the feature “satisfaction_level”. Since the target feature is whether the employee has left, the negative correlation is easily explained.

In this article, I simulated the most difficult scenario for a general model: a case in which the groups composing the dataset are completely different from each other.

To simulate this situation, I merged some datasets that had nothing to do with each other, nor the features, and not even the prediction task! I have used a trick to make sure that the columns of the different datasets were concatenated together even if they had different names.

Then, I trained a general model on the merged dataset and many specialized models: one for each original dataset.

This was a stress test, to see what would happen in a ridiculously hard situation for the general model. Nevertheless, I found out that the difference in performance is minimum: 0.53% average loss in ROC score using a general model instead of specialized models.

Moreover, I used the experiment to show why explainability should not be a concern either. In fact, after using a general model, one can still explain the single groups separately through “specialized explainability”.



Source link

Previous Post

Introduction to Statistical Hypothesis Testing | by Steffi | Jan, 2023

Next Post

Set up Amazon SageMaker Studio with Jupyter Lab 3 using the AWS CDK

Next Post

Set up Amazon SageMaker Studio with Jupyter Lab 3 using the AWS CDK

Top 19 Applications Of Deep Learning and Computer Vision In Healthcare

Face Detection: Real-Time Deep Learning Applications (2023)

Related Post

Artificial Intelligence

Dates and Subqueries in SQL. Working with dates in SQL | by Michael Grogan | Jan, 2023

by admin
January 27, 2023
Machine Learning

ChatGPT Is Here To Stay For A Long Time | by Jack Martin | Jan, 2023

by admin
January 27, 2023
Machine Learning

5 steps to organize digital files effectively

by admin
January 27, 2023
Artificial Intelligence

Explain text classification model predictions using Amazon SageMaker Clarify

by admin
January 27, 2023
Artificial Intelligence

Human Resource Management Challenges and The Role of Artificial Intelligence in 2023 | by Ghulam Mustafa Shoaib | Jan, 2023

by admin
January 27, 2023
Deep Learning

Training Neural Nets: a Hacker’s Perspective

by admin
January 27, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.