The curse of dimensionality refers to the problems that arise when analyzing high-dimensional data. The dimensionality or dimension of a dataset refers to the number of linearly independent features in that dataset, so a high-dimensional dataset is a dataset with a large number of features. This term was first coined by Bellman in 1961 when he observed that the number of samples required to estimate an arbitrary function with a certain accuracy grows exponentially with respect to the number of parameters that the function takes.
In this article, we take a detailed look at the mathematical problems that arise when analyzing a high-dimensional set. Though these problems may look counterintuitive, it is possible to erxpalin them intuitively. Instead of a purely theoretical discussion, we use Python to create and analyze high-dimensional datasets and see how the curse of dimensionality manifests itself in practice. In this article, all images, unless otherwise noted, are by the author.
Dimension of a dataset
As mentioned before, the dimension of a dataset is defined as the number of linearly independent features that it has. A linearly independent feature cannot be written as a linear combination of the features in that dataset. Hence, if a feature or column in a dataset is a linear combination of some other features, it won’t add to the dimension of that dataset. For example, Figure 1 shows two datasets. The first one has two linearly independent columns and its dimension is 2. In the second dataset, one column is a multiple of another, hence we only have one independent feature. As the plot of this dataset shows, despite having two features, all the data points are along a 1-dimensional line. Hence the dimension of this dataset is one.
The effect of dimensionality on volume
The main reason for the curse of dimensionality is the effect of the dimension on volume. Here, we focus on the geometrical interpretation of a dataset. Generally, we…