Study gains in guaranteed, supported, and mean accuracy; discover biases
While people say between 10% and 30% is a good test size, it’s very problem-dependent. A good starting point is what I call the guaranteed accuracy — accuracy minus two standard deviations. Using a representative, but not fine-tuned dummy model, it is possible to simulate many random seeds of different test sizes. Take the mean and standard deviation of each test size you tried and then choose the one that maximizes the guaranteed accuracy. You can even do it without looking at the actual accuracy. When doing this for the processed Cleveland data from the UCI Heart Disease dataset using default PCA logistic regression from sklearn, I found the best test size to be 49%. For those strong at heart, the supported accuracy, which is the mean minus one standard deviation, can also be used. That gives 38%.
The next question that presents itself is how many folds to use for cross validation (cv). To make things simple, let’s talk about cross-validating to optimize the final accuracy. After all, the impact of hyperparameters can be very model-specific, whereas accuracy is a more general measure.
The same idea of guaranteed accuracy can also work in figuring out how many folds to use — but it requires some more steps. First of all, we will have to drop the idea of k-fold cross validation and work with repeated hold-out. The reason is that statistically speaking, building J random models that are sampled identically is a lot easier to model than k-fold cross validation. There are still two cases in which k-fold cross validation is easy to understand and we can essentially read all the information we need from a well-chosen plot — but to get there, we first have to understand repeated hold-out.
When you build many models from the same data and test it on some of the same data, correlations become an issue so that the standard deviation of your models’ individual accuracies no longer equals the error of the total accuracy. In fact, the correction can be more than a factor of 4! Have a look at the implementation in the code below or you can check out my blog post for the details.
But, when all is said and done, you again get a plot of the guaranteed accuracy as a function of the number of models J and the test size. Maximizing over the latter for each J, we obtain the following plot for the UCI heart disease dataset (see Github link at the end of the article):

Here’s the definition of guaranteed and supported accuracy:
This relationship between the variance of the J models for repeated hold-out to the variance of the total accuracy was discovered in 2003 by Nadeau and Bengio (NB) using simplifying assumptions (we only had to make a small change for J < 5). We find their assumptions to be violated but still very meaningful.
Now, back to k-fold cross correlation. We don’t know the guaranteed accuracy here except in one simple case: 2-fold cv. 2-fold cv has no correlations between either its train or test sets, so the standard deviation of the models’ accuracies is the same as that of the total accuracy. It turns out that the guaranteed accuracy for 2-fold cv is 0.2% higher than for optimal (test-size selected) two-fold repeated hold-out. After 2-fold cv, the best we can do is plot the mean accuracy of the different selection strategies:

Comparing the green and the blue curve, we can see that while for 2-folds, repeated hold-out is better because it has a smaller test size (see the plot above), in the intermediate region, stratified k-fold cv is slightly better because it can afford a larger training set. For over 50 models/folds, repeated hold-out is again better because it is able to maintain a larger test size: In fact, we do expect some test set bias appearing for 300 folds — leave-one-out-cv (LOOCV). The reason is that it is a simple average of N=300 single observations. Therefore, we expect to be one error of the mean away from the true accuracy of the model, which in this case is 2.1%, four times more than what is observed. Therefore, we can say that k-fold cross validation fell prey to bias in the test set at large number of folds whereas repeated hold-out is able to hold the test size at 5% with arbitrary number of models and so was less affected by the bias. This bias is expected to decrease very rapidly with increasing test size because both the number of possible models as well as the number of test data points increase as the test size is increased. The fact that it is persistent up until six test data points — or 50 folds — is an indicator of strong correlations. Let’s now turn to repeated hold-out where all this will become even more clear:

This plot shows the mean accuracy for repeated hold-out. The nice thing is that we can clearly see the bias as the residue from the best-fit line. There is bias in LOOCV and LTOCV (leave-two-out cv) and we see that it persists to about 10% test size. On top of the bias, we see variance appearing starting at 1% test size (three test points) and until about 10% test size (30 test points), disappearing at the same point as the bias. These oscillations do not seem to decay at an appreciable rate as you build more models, shown in green. The NB algorithm does not include either of these features, so we use a rolling window and exclude test sizes < 5%. We also see that LOOCV does not yield any improvement over a test size of 10%, which goes well with previous analyses that found that 10 folds already produce optimal results. However, we do see a small improvement with a test size of ~5%, which is what we capitalized on in the previous plot.
If you were wondering, the reason the NB error is 2% while the points are all on the line is that this is the expected training set bias, so it won’t show up as variance on the model. It quantifies the chance that we got unusually lucky or unlucky in all our data. Added on top of that is the effect of less training data of any one model that we are trying to estimate with many random seeds. The variances around the line do increase a lot when you average fewer models.
Looking at the second plot, we see that repeated hold-out (in green) is a good indicator (lower bound) on k-fold cv (in blue) between 3 and 35 folds. Therefore, looking at the first plot, we see that if we want to stay within 1% of the best possible guaranteed accuracy gain (3.34%), we need 20 folds. If we want to stay within 0.5%, we need to use repeated hold-out with test-size of 5% and 70 models. For supported accuracy, we want to use k-fold cv between 6 and 16 folds, so if we want to stay within 1% of the best possible accuracy gain (2.39%), we need to use 9 folds and if we want to stay within 0.5%, we need to use repeated hold-out with test size 5% and 31 models.
Generalizing to other datasets, we showed how to use repeated hold-out as a proxy to study the gain in accuracy achievable by increasing the number of folds and at which points to switch to repeated hold-out and why. The test set bias of LOOCV can be estimated by finding its error of the mean. It will be an order of magnitude estimate of the true bias (in this case, off by 75%).
Here is the link to the code on my Github.