Clarifying the Confusion for a Deeper Understanding
Naming things can be tricky, and sometimes words and concepts don’t quite match up. There are many examples of misnomers in various fields of knowledge: strawberries are not actually berries, a starfish is not actually a fish, and a shooting star is not a star.
The machine learning field is not an exception, there are surprising lots of misnomers and confusing naming conventions. Although these misnomers may not necessarily make the usage of the terms incorrect, they can be particularly confusing for newcomers to the field. This post aims to discuss some commonly used machine learning models. Changing the For each term, I also will suggest some alternatives to describe them more accurately.
You may be surprised to see this one here. Linear Regression? What is confusing about Linear Regression, it is so simple and straightforward. Well, it is actually VERY confusing for different reasons. And we will see how many other machine learning models are relative to this term, causing even more confusion.
Linear regression itself is not confusing but this name is so simple that it could be used for a larger number of machine learning models. Let’s see, it is composed of two words: LINEAR and REGRESSION:
- LINEAR means a linear model which is y=wX+b and it tells about the linear relationship between the features and the target variable
- REGRESSION means it is a model for a regression task, which means that the target variable is continuous.
However, behind this linear model, there can be various loss functions (absolute error, Huber, epsilon insensitive, etc.), and the term “Linear Regression” usually only refers to squared error as the loss function. We can see the scikit learn estimator LinearRegression refers exactly to the linear model with squared error, and this one only.
I think that historically when the name is first used, the only loss function we think of is the squared error so it is not necessary to specify the loss function. Then the name linear regression stuck.
Now, if we use for example the absolute error as the loss function, the name “linear regression” is also technically correct.
A better name? It already exists, specified in the scikit learn documentation of the estimator LinearRegression:
Ordinary Least Squares (OLS) Linear Regression
- The term “Least Squares” means that this model minimizes the squared error.
- Why “Ordinary”? I thought for a while that less ordinary models would be the ones with regularization, but it seems that the historical reason is that there is a Weighted version: Weighted Least Squares Linear Regression.
To some extent, LinearRegression without regularisation is not a proper machine learning model, since there is no hyperparameter to be tuned, so you can not optimize it again overfitting or underfitting.
And penalized linear regression is still OLS Linear Regression… let’s move to the next confusing model names!
Wait, before finishing with this one, let me tell you another confusing truth: linear regression can also be used as a linear classifier! So the term “regression” in the name does not make it exclusive for regression tasks… We will explore this idea later in the article.
Ridge, LASSO (Least Absolute Shrinkage and Selection Operator), and Elastic Net are actually “Linear Regressions”, and even more specifically, “Ordinary Least Squares Linear Regression” because they all use the squared error as the loss function.
I told you that the term Linear Regression is actually VERY confusing! And it is not finished.
Now, their cost function is different. Now, what is the difference between the loss function and the cost function? Here is an interesting answer that can help to clarify the definitions:
Loss function is usually a function defined on a data point, prediction and label, and measures the penalty.
Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization).
For linear regression and its regularized versions, here is a recap:
So in the cost function, the coefficients can be regularized, and the loss function remains the same. Depending on the norm of the coefficient that is regularized, we have different versions of Penalized Ordinary Least Squares Linear Regression. Despite these seemingly very different names, they are all OLS Linear Regression. And what would be alternative names? Here are some suggestions:
- Ridge regression: L2 penalized OLS Linear Regression
- LASSO: L1 penalized OLS Linear Regression
- Elastic net: L1 and L2 Penalized OLS Linear Regression
So there are many names for the regularized version of linear regression. Now, did you ever wonder what are the names of the regularized version of other models? For Logistic Regression, or maybe SVM? We will talk about them later in this article.
When the term Polynomial Regression is used, we have the impression that it is a standalone model, whereas it should be considered as Polynomial Features + Linear regression. That is why in scikit learn, we create a polynomial regression in this way.
From a statistical perspective, the package numpy offers polyfit, but you will notice that is it only for one feature, so it is very restrictive for a model. I wrote an article to further explain it: The Hidden Linearity in Polynomial Regression.
So an alternative for Polynomial Regression could be
Linear Regression with Polynomial Features
And in practice, you should penalize the coefficients and scale the features since they can become huge. You can read this article: Polynomial Regression with Scikit learn: What You Should Know
I recently wrote this article: Is Logistic Regression A Regressor or A Classifier? To discuss this question and here is a summary:
From a statistical perspective, we can discuss the model, the output and its place in the Generalize Linear Model Framework:
- The logistic regression model is : p = 1 / (1 + exp (- (wX + b) ) ) and it model the probability
- Logistic regression is a regression because the output is a probability between 0 and 1.
- And it is one of the models from GLM (Generalized Linear Model), with a link function which is logit (inverse function of logistic function), and the distribution of the output is the Bernoulli distribution which can be used in the MLE (Maximum Likelihood Estimation) to find the optimal coefficients.
So, the name Logistic Regression seems legit from this statistical viewpoint.
From a machine learning perspective, it is a linear classifier, and more specifically:
- The model is the linear model: y = wX + b and y here is called the decision function because to make it a classifier, we have to decide with the sign of y: if y is positive, and the predicted class is positive. and the same for y<0.
- Logistic Regression is a classifier since it is applied to a classification task
- Logistic Regression is a linear classifier with the log loss as loss function.
So, from the machine learning perspective, Logistic Regression can cause confusion because of the “regression” term in the name. From the scikit learn documentation, we can see some alternative names:
Logistic Regression Classifier, Logit Classifier, or Maximum Entropy Classifier
These names actually reveal even more hidden facets of Logistic Regression, curious readers can Google them. I would add my own:
Log Loss Binary Linear Classifier
There also are some alternatives, since log loss is also called logistic loss or cross-entropy loss.
And since we talked about the regularized versions of Linear Regression, did you ever wonder if there are regularized versions of Logistic Regression and how to create them? The answer is we could call them Ridge Logistic Regression, Lasso Logistic Regression, etc. but we usually do not see these names often. At least in scikit learn, they are still Logistic Regression because, in the estimator LogisticRegression, we can specify the penalty (l1, l2, or even elasticnet), and the hyperparameter C that is for the strength of the regularization.
What is confusing about this name is that SGD (Stochastic Gradient Descent) is actually not a characteristic of the model, but it is the fitting algorithm that is used for the model. I wrote two articles to explain all the hidden models in SGDRegressor and SGDClassifer.
So my suggestion is for this estimator, we could name them LinearRegressorSGD and LinearClassifier SGD and the full name would be
Linear Regressor / Classfier fitted with SGD
From a machine learning perspective, this estimator gives a quite broad view to conclude different linear models:
The name itself may not be confusing since it describes the fact that there are a few data points from the training dataset (called Support Vectors) that will be useful to define the model.
Now, SVM has another facet as a linear classifier. And from this perspective, it is often described as using a hyperplane to separate the space into two parts (by the way, this is actually all linear classifiers do, logistic regression for example. the difference is ) by maximizing the soft or hard margin.
Of course, the two approaches are equivalent and it is called the dual problem.
What is confusing? When you see something like (and you see it all the time):
SVM is a model that tries to maximize a hard/soft margin…
This statement is not false, but it mixes two different perspectives!
In the image below, I compare the two perspectives:
- From the linear classifier perspective, the “SVM” model is the linear model and it tries to maximize the margin. And this is also equivalent to minimizing the cost function with the hinge loss and L2 penalty. The result is that the model creates a separating hyperplane, and a new observation is classified according to its distance from this hyperplane (which is proportional to the value of the decision function).
- From the true SVM (or instance-based model) perspective, the model is written as the sum of weighted dot products between the new observation of the support vectors. The objective function is very different yet equivalent. There is also a distance-based interpretation, but this time, the distance is directly relative to the support vectors. That is why sometimes it is compared to KNN.
Now, you may say that the hyperplane is defined by the support vectors. Yes, I know. Although the interpretation seems very intuitive it is a result of mixing up the two perspectives.
So, the name SVM is quite good from the true SVM perspective. But from the linear classifier viewpoint, it can be called:
L2 penalized Hinge Binary Linear Classifier
I had this question before about the penalized version of SVM. Well, by definition, it is L2 penalized. And actually, the “margin maximizer” part of SVM is just another description of L2 penalization!
So it is possible to have an “L1 penalized Linear Classifier with Hinge loss function”, but it would be called “SVM”.
Another quite confusing statement is SVM can be used for regression, and it is called SVR (Support Vector Regression). But it is for another time…
When you put linear regressor and linear classifier together, this makes you think that they are MECE (Mutually Exclusive, Collectively Exhaustive), just like KNeighborsRegressor vs. KNeighborsClassifier, or DecisionTreeClassifier vs. DecisionTreeRegressor.
But this pair is actually different!
The scikit-learn documentation for SGDClassifier lists possible loss functions: hinge, log_loss, log, modified_huber, squared_hinge, perceptron, squared_error, huber, epsilon_insensitive, or squared_epsilon_insensitive. The documentation also includes this comment below:
‘squared_error’, ‘huber’, ‘epsilon_insensitive’ and ‘squared_epsilon_insensitive’ are designed for regression but can be useful in classification as well
So, we can confidently say:
A linear regressor is also a linear classifier
It may not be performant, but it is technically possible.
And this statement also reveals the construction of a linear classifier: a Linear Classifier can be seen in two steps:
- Building the Linear Model (which can be called linear Regression, since the output is continuous)
- Building the Hyperplane separator (which only means that y=0 is the decision boundary to classify)
We can call a linear classifier in this way:
Hyperplane Separator Classifier with a linear model
And a linear regression is then equivalent to a linear model since the output is always continuous.
Gamma regression is a model from the GLM (Generalized Linear Model) family. The name “Gamma regression” indicates that we consider the dependant variable y knowing x follows a Gamma distribution. This fact alone is not confusing, but when we study the names of different regressors from GLM, then we clearly find out that the naming convention is not the same for all of them.
For GLM, we have to specify the link function and the distribution. For Gamma distribution since its support is strictly positive real numbers, the link function is usually log or inverse.
So when we use Gamma Regression, it is actually not clear which link function is used. Now, you may say, we don’t have to specify the link function. But “Logistic Regression” does! From the GLM perspective, “Logistic Regress” get its name from the fact that its link function is the logit and the inverse function of logit is the logistic function.
You can see the image below from the GLM Wikipedia page:
So if we want to be specific, following the link function used in the model, we should say Gamma Exponential Regression or Gamma Inverse Regression.
And for logistic Regression, if we adopt the same naming convention, we have to specify the distribution specified:
Logistic Regression should be named more specifically “Bernoulli Logistic Regression”
Which is different than “Binomial Logistic Regression” because the input data is different. But the result of modeling is identical. You can read this article to learn more: Logistic Regression: Bernoulli vs. Binomial Response Variables.
Now, let’s go back to our Linear Regression (again!) since minimizing the squared error is equivalent to maximizing the likelihood of a Gaussian distribution of the residuals. If we choose to name the OLS regressor with the distribution used in MLE (Maximum Likelihood Estimation),
Linear Regression (aka OLS Linear Regression) could also be named “Gaussian Linear Regression”
Now, imagine the model that has the Exponential distribution and the log link function, it would be Exponential Exponential Regression! You get it.
In practice, my suggestions of names are too long. But it can be useful to bear them in mind to have a better understanding.
Now, another question: Gamma belongs to GLM from a statistical perspective, what is it from the machine learning perspective?
We can look that this table from scikit-learn to get an idea. And it is quite difficult to find a name for it.
Gradient Boosting is an ensemble method and as an ensemble method, it can be used to “boost” all kinds of base models. However, in practice, it is almost always used for decision trees that the name Gradient Boosting actually means Gradient Boosted Decision Trees. It is so true that in scikit-learn, GradientBoostingRegressor and GradientBoostingClassifier are all about boosting decision trees, and all the hyperparameters are related to decision trees.
But we have to bear in mind that it can be applied to other basic models such as linear regression. I wrote this article to implement a Gradient Boosted Linear Regression in Excel. Why? Because I found it simpler to understand how the Gradient Boosting algorithm itself works, since fitting a linear regression is more intuitive than fitting a decision tree, and it can easily be done in Excel.
And as for other ensemble methods, we can notice that we can actually specify the base model. For example, in BaggingClassifier or BaggingRegressor, we can specify the base estimator and it is a decision tree by default. By the way, Bagging means Bootstrap aggregating, which is quite accurately a description of how the ensemble method works, compared to Gradient Boosting, which is subtly complicated… And this actually is quite a good choice. And I hope that my following explanation helps you appreciate this subtlety.
In the process of Gradient Boosting, “boosting” actually only means “adding” and more specifically “adding small improvements”. And the process of Gradient Descent applied in the parameter space (when used in fitting a linear regression for example), also can be considered a process of Boosting the Parameters.
So if we have to adopt the same naming conventions, then the classic gradient descent is either Gradient Descent for Parameters Optimization or
Gradient Parameter Boosting
Now, for Gradient Boosted Decision Trees, can we use Gradient Descent Trees? Well, it is not accurate enough since adding trees together does not make a simple tree in the end. This is different for Linear Regression because Gradient Boosted Linear Regression is still a Linear Regression, so Gradient Descent Linear Regression is OK. So we may use this term:
Gradient Boosted Decision Trees can also be called “Gradient Descent Forest”
This term emphasizes the fact that when Gradient Descent is used for the “boosting” process of decision trees, the final model is not a Tree, but a Forest. This also echoes the idea of adding trees to make a forest in Random Forest.
Finally, the term “Neural Network” can be considered the epitome of confusing terms. Let quote François Chollet:
What’s more? In the neural_network scikit-learn module, the estimator is called MLP for Multi-Layer Perceptron. So it is another equivalent name for the neural network.
I wrote an article to explain neural networks as “simple” mathematical functions: How to define a neural network as a mathematical function
Here are some illustrations. The one below is to show that the “neural network diagram” with neurons, and synaptic links is only a representation of the mathematical function itself. Since the function with repeated function compositions can be quite huge, the neural network diagram is actually a very graphic and visual way to represent it. And the “multi-layer” term only means here “repeated function compositions”.
For a one-layer neural network, it is still simple enough that we can full develop the mathematical functions.
So the “biological neurons” analogy is rather fanciful and imaginative. Now, this also reveals the importance of communication and marketing around new ideas. When researchers find out the interesting characteristics of “neural networks”, we can imagine a headline with this:
Math Researchers use “Chains of Differentiable Parameterized Geometric Functions” to Improve Model Performance
This probably will not interest the readers, but instead, when we title it:
Artificial Intelligence Breakthrough with Neural Networks
Now, we would talk about another level of click rates.
Another fancy term explain in Chollet’s Tweet is backpropagation:
Backpropagation is only gradient descent applied to “neural networks” with chain rules.
To finish with neural networks, we can also talk about Deep Neural Networks. It may make you think that from a Multi-Layer Perceptron, you add even more layers, and the final model would be more performant.
Well, this is not what this term tries to represent.
First, about the number of layers in a neural network, we can read this article: How to choose the number of hidden layers and nodes in a feedforward neural network?
And I quote here:
One hidden layer is sufficient for the large majority of problems.
I also wrote this article to visually explain Why Deeper Isn’t Always Better.
So it seems that the multi-layer notion is not helpful at all to improve the performance of the models.
As for Deep Neural Networks, it is preferred to think of multiple phases of “deep” feature engineering, that allow hierarchical representation/feature extractions. And the term Deep learning may be preferred since it would introduce another fancy idea of biological neurons.
Don’t be fooled by the names you see. Understanding the inner work of the model is key!
Naming is a difficult task because the name should accurately reflect the idea behind it and the names have to be short. That is sometimes impossible. That is why when we try to find alternatives, they are usually longer.
Naming is difficult also because it is often related to a whole field. Just like a “peanut” is not a nut, but this is from a botanical viewpoint. From a culinary perspective, it is a nut.
Moreover, the authors of the names may try to reflect some specific characteristics that appear important to them, and future readers may not grasp the main idea and can develop fanciful interpretations of the models. The best examples are neural networks and SVM.
But another truth is that math researchers should also use some “click bait” like names to grasp the attention of others and the general public.
So, the final conclusion is that we should not be fixated on labels when attempting to comprehend things. Instead, by adopting various perspectives, we can achieve a more comprehensive understanding.