Catch-64: The Curse of Dimensionality and Your Model

Guymonahan
4 min readFeb 9, 2021

The curse of dimensionality is where the feature spaces increases and as a result the volume of the space increases so fast that the available data become sparse, and therefore the error increases too. When we first learn about dimensions as kids we are told there are 3 dimensions. For those that take hard sciences or dig a little deeper, there are at least 4 dimensions, sometimes up to 11 or 26 (depending on your comfort with modern String Theory). But when it comes to approaching data in data science there are interesting things that pop up when more and more dimensions (read: features) are a part of the model.

As dimensionality increases, the importance of single dimensionally values gets swallowed up by the sheer weight of the others.

The usefulness of the model collapsing in on itself under the weight of all the features

If we look at how the scale of volume adjusts as extra dimensions are broached we can look at a square. A square of 4 units across a dimension is 16, a cube would be 64, and at the 9th dimension it ramps up to 262,144. Each added dimension (feature) drastically increases the space that it takes up in the weight of the model.

A multi-dimensional model that is able to cleanly separate both classes
Too many features makes it over-fit and will not be good outside of the training data

As the model becomes greater in size, the model has to increase in complexity to match the bounds of the volume. Especially as data is run through polynomial amounts of variations when being worked on, each of these need to account for when trying to process the data coming in and storing it in a way that makes sense. The data, when spread out across the space will be less useful as a model as the specificity collects the data points in space, and ‘spikes’ of hyper-shapes are expecting ‘x’ and getting ‘y’ instead. As the number of dimensions grow the amount of training data points needed to adequately occupy the space needs to be big enough to match the exponent. Infinite features requires infinite training.

2D square / 3D cube / 8D hypercube (w 256 corners)

If you are to look at the hypercube on the right we see that while many of the ‘dogs’ in the data are still relatively close to each other, because the model is getting so complex, the actual route through ‘space’ must run along the model and therefore it has to go all the way towards to the middle to go back out to the nearest point. Think of taking the subway from Brooklyn to Manhattan, to just need to head back to Brooklyn (not very efficient but sometimes it can be necessary). As the space of the dimensions is filled with data, the variability between the a particular data point and another will be more and more minuscule, and therefore making the data that much harder to draw any cogent analysis from. Since things like K Nearest Neighbor and Decision Trees rely on the relative distance between data points, this distortion makes it difficult for the machine learning to make sense of what data it gets in the ocean of many dimensional space, and as such will create outputs that don’t make sense.

As the number of dimensions increase the ability to garner useful data drops off very quickly

Neural Networks are a way to engage this tsunami of data and warped space. They work by transforming the series into cogent sequences, rather than the data point in space more basic models can deal with the data in. Gradient descent helps in the fight to optimize the parameters and reduce error. By divining the closest instantaneous rate of change at a minima they are able to find the best possible solution. But the issue arrives in these cases where many dimensional space models are as smooth as the Himalayas and have many local minima in a moat around the true global minima of error space. But this is another instance where the dimensionality shows up as a tricky foe, as the high dimensional space is bumpy, it should help make the true global minima stand out, but as things you get closer to it there is a potential for over-fitting. Another way to think of it is a catch-22, as you make it easier to find ways to make it easier to optimize the best number of features to explain your model, you simultaneously might be creating a model that is over-fit that might make the current model not really usable. At the end of the day gaining insight by many layers of featured dimensions is a trade-off when you get close to the edge of a perfect model and the data is very extensive.

--

--