The ability to automatically identify patterns in raw data and project them onto unknown domains is a key feature of machine learning algorithms. Just like a student at school, by showing a machine a set of examples containing a certain (implicit) pattern, we expect the machine to understand the logic itself and to avoid simply memorizing the facts. But what if the simple patterns are not as simple as they seem at first glance or the student is lazy?
As a technical expert in an interview, I like to give a simple task: Let’s teach the machine to add two numbers.
This problem belongs to the supervised learning class and assumes two independent sets of numbers A and B and a target set C consisting of their sum A+B. As an example, we take two vectors of random integers and compute their sum:
Each pair Aⱼ and Bⱼ and their sum Cⱼ may be represented as a dot in a 3D space (the bottom plane is A-B plane, and the vertical axis is C):
A simple problem should be solved via a simple method: let’s initiate a Linear Regression model and train it to sum up two numbers.
The test implies three different scenarios: 3 known integer pairs, 3 unknown integer pairs, and 3 fractional pairs.
The linear model lived up to its name and correctly found and deduced the patterns of adding the two numbers.
Previously we used a small set of random numbers of length 100 to train the model. Let’s imagine that the next time we ran our code we came across a very weird (but not impossible) set of input data:
or as a Python code:
One can see each of Cⱼ for this sampling equals 100. The probability of such a combination is incredibly small, but non-zero. Such a combination might be represented as a straight line on our general “addition” surface in a 3D space:
Let’s re-train our model once more on the sample set from above and do the same test:
Whatever we use as an input pair the result will always be 100. It’s not as odd as it might be at the first sight: no matter how the training data is correct, the zero variability of the target value does a trick.
Let’s understand how the linear model works and why it doesn’t work in this particular case.
Linear Regression is an algorithm for modelling the linear dependency between independent variables and the target one. In general, the linear relationship formula might be represented as:
While wᵢ represents the weight (or slope) of each variable xᵢ, the general constant b describes an overall bias of the n-dimensional line (or intercept with the y-axis). In other words, b equals y at the point when all xᵢ equals 0 and wᵢ is how much of xᵢ should we take for a total sum.
Now let’s imagine what happens when all y are equal and constant. In this case, the “lazy” algorithm instead of adjusting weights may just set them to zero. Thus the equation becomes y = b + 0 or b = 100. For sure, it’s the simplest way.
To avoid underfitting we should set the intercept parameter to zero instead. In this case, the learning algorithm will have to search for proper weights (in our case both of them equal 1) and the resulting formula becomes (how unexpected!):
In fact, we shouldn’t tune a linear regression model that much when it’s trained on a purely random training set with a “healthy” target value.
Coming back to code, the sklearn’s Linear Regression object has a parameter fit_intercept which is True by default. It means the linear regression model will always calculate the intercept parameter and, as a result, fall into a bias underfitting. Let’s set it to False and check the result:
Basically, we just changed the linear regression function to the formula of addition (indeed) and asked the model to find the best weights (equal 1).
The fact that the hint with intercept = 0 helped us this time doesn’t mean it should be used all the time. It just shows the necessity to fine-tune the hyperparameters of the model from task to task.
In this section, I would like to show how other architectures learned addition and give a brief explanation. I will use only pure Python sklearn realizations with default hyperparameters.
Test 1: Random training set
Test 2: Biased training set
Here we tested 4 families of machine-learning algorithms:
The Linear Regression family
Ridge, Lasso, and Linear Regression as it is.
Basically, Ridge and Lasso are types of regularization; a complementary logic to the Linear Regression algorithm that protects a model from overfitting. All of them tend to shrink the weights to zero, thus introducing bias but reducing variance. Since bias is already a problem in our case, the regularization may not affect much. Indeed, the test results show pretty much the same picture for all the Linear Regression siblings. Also, all of them could be fixed by fit_intercept = False.
Notice the Lasso (as well as ElasticNet) results for the first test are quite close to the real ones (but not exact) for both known and unknown data. I expect the fine-tuning of the parameter alpha (L1-regularization parameter) may fix it.
Support Vector Regression (with a linear kernel)
While ordinary Linear Regression searches for a line with minimum distances from the observations, the SVR tries to fit the best line within a threshold margin.
One can see that the results of the SVR’s random test are comparable to the actual sum within some deviation. Similar to Lasso, this might be fixed by decreasing the epsilon parameter — the margin size.
The biased test shows the same underfit result — all 100-s. The solution to this problem is beyond the scope of this post.
Decision Tree Regressors and their ensembles.
In contrast to linear regression and SVR with their single resulting equation, a Decision Tree searches for a set of if–else rules that best model the data. As a result, the decision tree divides the space of variables into segments, each of which is subject to a separate rule.
For example, if x is less than 3, then y equals 5, if x is more than 10, y equals 15, and so on…
This set of rules might be complicated, especially if a tree is very deep.
On the other hand, such an approach is prone to overfitting (if the tree is very deep) or underfitting (if it’s too shallow), and has one unpleasant feature: decision trees and tree-based algorithms are very bad at extrapolating data outside the training range.
For example, if the maximum x-value of our linear dependency is 10 and it corresponds to the maximum y-value of 10 (y=x case), the decision tree may answer as “if x ≥ 10, y =10” no matter if x=10.1 or 1000, y will always be 10.
This feature is presented in our random test output: one can notice that the predicted value for the out-of-scope variables range (A + B > 200) gives something close to 200 (also it gives ~0 for A + B < 0). Whereas there are various techniques for overcoming this issue involving normalization, features engineering, etc., the usage of stock tree-based algorithms is not the best choice if you want to create an AI-based calculator…
Also, similar to all the previous algorithms, the tree-based ones end up with 100-s for the biased test.
Neural networks, by their nature, might be compared to decision trees: both models solve problems by deconstructing them piece-by-piece, but while trees have a more deterministic approach (strict if-else split) neural networks use a probabilistic way. Similar to decision trees, the accuracy of the neural network depends on its depth; the model with too many neurons (similar to nodes in trees) overfits, and a model with too few doesn’t perform at all.
But for our needs, we should notice one of the main differences between trees and neural networks: the structural unit of any tree is a single if-else split while the neural network consists of tiny regression units. This produces several features:
- Neural networks can extrapolate (and do it quite well).
- Each layer adds to the complexity of the polynomial thus a neural network is actually a special class of polynomials. From a mathematical perspective, neural networks can approximate any continuous function infinitely closely.
For simplicity, I used the simplest form of a fully-connected neural network — a Multilayer Perceptron.
Returning to the review of the performance of our models, MLP turned out to be the only model that produces results without underfitting in a biased test from scratch. Moreover, the results of both tests can be considered similar with a certain error. Due to its ability to find very deep patterns, I would give this architecture a chance by playing with hyperparameters and/or adding more “A+B=100” data.
As a final touch, let’s try to add one outlier instance to the training set. Let it be 1+1=2.
Such a dilution of the data (1:100) allowed most of the models to identify the hidden pattern, even if the result was not very accurate.
- Even the simple task of finding an elementary pattern may become a complicated one if using the wrong or unprepared tools.
- Different machine-learning techniques act differently. It is crucial to know the algorithm you work with, its pros and cons, scope, and parameters.
- A necessary (but not sufficient) condition for successful model training is the quality of training data: their variability, comparability of training and test samples, their sufficient volume, etc.