A guide to develop deeper intuitions for these two concepts
Bias and variance are two of the most fundamental terms when it comes to statistical modeling, and as such machine learning as well. However, understanding of bias and variance in the machine learning community are somewhat fuzzy, in part because many existing articles on the subject try to produce shorthand analogies (“bias” = “underfit”, “variance” = “overfit”, the bullseye diagrams). While these analogies are fine if you want to quickly describe the performance of a model (“The model has high bias and low variance”), I found that they remove the underlying beauty and concreteness of the bias-variance tradeoff. I hope that with this article, the reader may find a deeper understanding of bias and variance in statistical modelling. I also hope that they can actually use bias and variance to apply to other situations.
Bias and variance originate from the field of statistical learning. Statistical learning is a field that tries to put a model on collected data such that (1) they can be predicted or (2) they can be understood.
Example: The very popular Boston housing dataset is a dataset collected by the U.S. Census Service in 1996. It contains multiple variables such as per capita crime rate, number of rooms per dwelling, proportions of non-retail businesses, and price value.
If you want to build a model of the price of houses, you would set other variables to be the predictors of the model and the price of the houses as the response variable. If there are p different predictors X1, X2 ,…, Xp and one response variable Y, then we can assume a model of Y using X to be:
In this equation, note that besides the function f(X) that is fixed, there is also another term ϵ in the equation. This is the error term. This term is the differences between the modeled value and the actual values and could represent things like noise or random processes. The main takeaway is that ϵ is both independent of X and has a mean value of zero.
In the real world, if we set out on the task of modelling this data, we likely would not get access to all data points (the collection of all data points is called “population”). Instead, we would only get access to only a selected amount of data points (this is called a “sample”). Here, we have a sample of 50 data points. This will become relevant in our discussion of variance, as multiple sampling of the same population will produce estimation models that are slightly different.
In order to fit a model to this dataset, it is good to start as simple as possible, then moving up in complexity. Here, I will use n-power polynomial models, starting with n=1 (linear regression) then increasing the value of n all the way up to 10. All model are fitted to minimize the mean squared error value.
Now, let’s move onto the main part of this article: developing an understanding of variance and bias. We will talk about variance first because it is simpler to understand. To put simply, not all samples are the same. When two samples are different, we produce models that look slightly different.
Here we say that we changed the parameters of the model. In the figure above, all models are third order polynomial with X³ as the highest power. Each models are trained on a slightly different sample of the population.
Let’s imagine that we repeat the process 100 times. Here is how they would look.
The goal of our model is to produce predictions. Using the models to predict the Y value of a new X yield the following:
Some models will produce predictions that varied wildly, but some other models produce predictions that are very consistent. We quantify the consistency by using variance. Statistically, variance of a model prediction is the mean (or expectation) of the squared deviation for all predictions. As it was said in An Introduction to Statistical Learning:
In general, more flexible statistical methods have higher variance.
After we produce a variety of predictions, we also want to consider what is the mean, or expected value, of the predictions. This mean value is called the expected value of all predictions. Bias refers to how much the expected value of all the predictions differs from the actual value.
With this figure, I hope you understand how bias and variance are measured and how they can be used to describe models that overfit or underfit.
- In our linear regression model (n=1), the mean of the predictors is very different from the actual Y value. The model also have wide range. So we can say that the linear regression model has high bias and high variance.
- In the model using n=7, the mean of the predictor is very close to the actual Y value. However, the model have very wide range. We can say that the n=7 model has low bias but high variance.
- In both the models using n=2 and n=3, the variance and the bias are both very small. The goal of statistical modelling is to produce and find models that have both low variance and low bias.
How do we know how “correct” our model is to reality? Well, I did give you a hint in the earlier section. We can calculate the error between the predicted value and the actual value, and then sum them up and take the average. This is, of course, how we arrive at the bias of the predictor. However, doing this will possibly backfire, because if the bias is 0, it could mean that the predictor predicts everything correctly (no variance) or that the predictor predicts everything incorrectly but they even out to zero (high variance). Instead, we sum either the absolute value or the square of the error. The square of the error term is more preferred in many cases because it leads to easier implementations in terms of finding a solution. We called it the mean squared error (MSE):
If you have taken a statistics course before, you know that this is also called the expected value of the squared error. The expected value has some properties that one can use to decompose the MSE. You should work through the long-form decomposition that Wikipedia and this blog post has derived. In the end, you should get
The var(ϵ^2) term is called the irreducible error. It determines the minimum achievable value of the MSE and, in turns, the predictor.
As an exercise, can you calculate the irreducible error value knowing that the error comes from a uniform distribution [-10, 10]?
From this decomposition, it is clear that the MSE is determined by the sum and the bias of the predictors. If we plot the MSE, the variance, and the bias term altogether for models with increasing complexity, we will observe a U-shape value for the MSE. Choosing a correct model requires choosing one that have the appropriate amount of variance and bias.
That is the bias — variance tradeoff. In this case, the MSE is lowest at n=3. We can say that the third-order polynomial best models the actual population.
Question: What will happen to the bias and variance of a predictor if I increase the number of training data?
Answer: Think intuitively, you will see that by sampling more data points, the resulting predicting model would be more stable. Therefore, the predictor will have a lower variance with increasing training data. The bias, however, will stay the same. This is because the bias simply evens out toward the mean regardless of the number of samples. In reality, if you have too little data points for training, the number of data points would also affect the bias as well because more data points will make the estimator fit better.
Question: What will happen to the bias and variance if I add a constant term to my model?
Answer: The variance simply measures the spread of the predictions. If I add a constant term, the spread will still stay the same so the variance will stay the same. The bias, on the other hand, will change in the direction of the constant term.
By now, I hope that you develop an intuition for bias and variance. More specifically, the reason why they are so often used to describe models. However, you might be wondering if they are just words to describe models without actual theoretical nor practical importance. Here, I will guide you through applying bias and variance to cross-validation.
Cross validation is one of the most common methods of assessing your model performance. Usually, to measure model performance, you will split the data into a training set and the validation set, update the model parameters using the training set, and then produce the assessment on the validation set. However, if the original dataset is not large enough, you can repeatedly train the model on different smaller subsets of the original dataset. The final assessment of the model will be a combination of all the assessments using the different validation sets. By looking at the sampling process, you immediately can see how this is related to the variance aspect of the predictor. Our goal is to find the model with the lowest MSE on new test values, so can cross validation do this?
First, let’s consider leave-one-out cross validation (LOOCV). In LOOCV, the testing set only has one sample, and the training set has the rest. The model is trained repeatedly n−1 times, where n is the number of data points in the dataset. The final assessment is the average of all trainings.
Immediately, you can notice the advantages of LOOCV over simple train-test split without cross-validation. If the existing data sample is good enough, the repeated sampling of the training set is similar to that of repeated sampling of the real population. There is one small caveat, however. When LOOCV is performed, all training samples will have very large overlap. This large overlap can produce high variance of the model.
To reduce the variance, another method of cross-validation called k-fold cross validation can be used instead. In k-fold cross validation, the sample is split into k equal parts. The test set is one part, and the training set is the rest. The model is trained repeatedly k times, each time with a different part for testing. Compared to LOOCV, k-fold validation can produce models with lower variance, but in turn has higher bias.
How good is cross-validation is at choosing the best model? Here, we will try to find the best polynomial power using K-fold cross validation (K=10). LOOCV produce a best value of 6 and K-fold produce a best value of 3. If we don’t want our model to overfit the training sample, we can use the simpler model with n=3.
I hope that this article will help you understand a little bit more about the intuitions behind bias and variance decomposition. From now you, when thinking about bias and variance of a model, you won’t have to think about analogies such as the dartboard anymore but instead can work out directly from first principles.
If you want to play around with generating the figures, this is the source code. Have fun!
All images unless otherwise noted are by the author. All data are generated by the author.
James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.