Deep Learning · Theory

The Curse of Dimensionality: Why Geometric Deep Learning Has to Exist

Before you can appreciate why GNNs, CNNs, and transformers work, you have to understand why naive high-dimensional learning is doomed — and why exploiting geometry is the only known way out.

June 202612 min readGDL, Learning Theory, Curse of Dimensionality

In the introduction to Geometric Deep Learning we said GDL is about baking symmetry and structure into neural networks. That's the what. This post is about the why — and the why is darker than it sounds.

The honest version is this: learning a generic function in high dimensions is mathematically impossible. Not hard. Impossible. No algorithm, no amount of compute, no clever trick gets around it. The number of examples you'd need grows so fast with dimension that the whole universe isn't big enough to hold your training set.

This is the curse of dimensionality, and it's the foundational problem behind Geometric Deep Learning. Everything GDL does — convolutions, message passing, equivariance — is ultimately a way of escaping this curse. So before the escape, let's understand the trap.

Deep learning doesn't beat the curse of dimensionality by being clever about optimization. It beats it by refusing to learn arbitrary functions in the first place.

Statistical learning, in four pieces

To talk about why learning fails, we first need a precise picture of what learning is. Strip any supervised learning setup down and you always find the same four ingredients.

Data distribution

Inputs x drawn from some ν over a domain X, with labels y = f*(x) from an unknown target function

Hypothesis class

The set F of candidate functions you allow — polynomials, or neural nets of a fixed architecture

Error metric

A loss ℓ(y, y′) measuring how wrong a prediction is — e.g. squared error

Estimation algorithm

A rule that picks one f̂ from F using only the finite data you actually have

We have data {(xᵢ, yᵢ)} where each xᵢ is drawn from a data distribution ν over the domain X, and yᵢ = f*(xᵢ) for some unknown target f*: X → ℝ. Maybe f*(x) is the excitation energy of a molecule x, or the probability P(y=1 | x) in a classifier. We never see f* — we only see samples of it.

The hypothesis class F is the menu of functions we're willing to consider. Crucially, it's organized by a complexity measure γ(f) — a non-negative number, usually a norm, that says how "big" or "wiggly" a candidate is. The number of neurons in a network is one example; a Sobolev norm ∫(1+ω²)ˢ |f̂(ω)|² dω (which penalises high-frequency wiggles) is another. Often γ isn't even written down explicitly — it's defined implicitly by the algorithm you run.

This organisation matters because of a hard fact: you cannot analyse learning without assumptions on both ν and f*. There is no free lunch. If the target can be any function at all, the data you've seen tells you nothing about the data you haven't.

The two losses, and the gap between them

What we want to minimise is the population loss — the average error over the entire true distribution:

Population loss

R(f) = 𝔼_ν[ ℓ( f(x), f*(x) ) ]

What we can actually compute is the empirical loss — the average over our finite sample:

Empirical loss

R̂(f) = (1/n) Σ_i ℓ( f(xᵢ), f*(xᵢ) )

For any fixed f, the empirical loss is an unbiased estimate of the population loss, and its variance shrinks like 1/n. So far so good. But there's a trap: in real learning, the f̂ we choose depends on the training set. It was picked precisely because it looks good on that data. So the naive point-wise variance bound is useless — we need bounds that hold uniformly over the whole class F at once (this is where Rademacher complexities and covering numbers enter). That word uniformly is where dimension will come back to bite us.

To control those fluctuations, we restrict attention to a ball in the hypothesis class — everything with complexity below some budget δ:

F_δ = { f ∈ F : γ(f) ≤ δ }

and then do Empirical Risk Minimisation (ERM) — pick the function in that ball that fits the data best:

ERM

f̂_δ = arg min_{f ∈ F_δ} R̂(f)

Equivalently written in penalised form (R̂(f) + λγ(f)) or interpolation form (minimise γ(f) subject to R̂(f) = 0).

The decomposition: three errors, not one

Here is the single most important equation in learning theory. Take the function f̂ your algorithm returned, and ask: how much worse is it than the best possible function? That gap — the excess risk — splits cleanly into three independent pieces. This is the classic decomposition of [Bottou and Bousquet]:

R(f̂) − inf_f∈F R(f) ≤ ε_opt + ε_stat + ε_appr

Every model that has ever worked is a balancing act between these three terms.

Read it slowly, because each term is a different kind of failure:

ε_appr — Approximation

Is the true function f* even reachable by my class? The gap between the best f in F_δ and the best f anywhere. Shrinks as δ grows.

ε_stat — Statistical

Does my finite data pin down the right function? The worst-case gap sup|R(f) − R̂(f)| over the ball. Grows as δ grows.

ε_opt — Optimization

Did my algorithm actually find the empirical minimum? The slack R̂(f̂) − inf R̂. Small when ERM is solvable.

The tension

δ pushes approximation down but statistical error up. Learning is finding the sweet spot — the classic bias–variance trade-off, renamed.

The geometry is worth picturing. Imagine two sets of nested contour lines on the same plane: one for the population loss (centred on the true optimum f*) and one, shifted and tilted, for the empirical loss (centred wherever your particular sample happens to pull it). ERM lands you at the bottom of the empirical bowl, f̂ — but you pay in population loss. The arrow from f̂ to f* decomposes into a horizontal hop (approximation — how far the restricted class sits from the truth) and a vertical hop (statistical — how far the empirical minimum sits from the population minimum within the class).

The whole game of supervised learning is choosing the complexity budget δ to balance ε_appr (which wants δ large, so the class is rich enough to contain f*) against ε_stat (which wants δ small, so finite data can still tell the functions apart). This is the bias–variance trade-off wearing a more precise costume.

So the question for the rest of this post is sharp: can we control all three terms at once when the dimension d is large? The optimization term, it turns out, is the friendly one. The other two are where the curse lives.

Bellman's curse: when "nearby" stops existing

The phrase curse of dimensionality is Richard Bellman's, from 1962. To feel it, start with the most basic learning principle there is: interpolation.

Nearest-neighbour methods, kernels, and similarity search all rest on one assumption — points that are close in input space have similar labels. If you've seen a training point near your query, you can just borrow its answer. In 1D this works beautifully: scatter a few dozen samples along a curve and any new point has a close neighbour to copy from.

The catch is the word near. How many samples do you need so that every point in the space has a neighbour within distance ε? In d dimensions, to tile the unit cube at resolution ε = 1/k you need one sample per cell, and the number of cells is:

Sample complexity

N = k^d = (1/ε)^d

This is exactly the covering number of a Lipschitz ball — and it is the formal statement of the curse.

That exponent d is the entire problem. Play with it below — drag the dimension and watch the sample count rocket past the grains of sand on Earth and then the stars in the observable universe within a couple dozen dimensions, while a million-point dataset collapses to almost no resolution per axis:

The Curse, in One Slider

Drag the dimension and watch every cost grow exponentially, not gradually

Dimension d3

Resolution — cells per axis (k)ε = 1/10

Samples to stay accurate

1,000

N = k^d = 10³

What 1,000,000 points buy you

100 /axis

n^1/d = 10^2.00

Cells you can resolve along each direction.

Volume in the outer 5% shell

27.1%

1 − 0.9^d

Fraction of points sitting near the cube’s surface.

Sample count vs. dimension (log scale)

A straight line on a log axis means the true curve is exploding. Doubling the dimension squares the number of samples you need.

Three things are worth doing in that widget:

Raise the dimension with k fixed at 10. By d = 20 you already need 10²⁰ samples — more grains of sand than exist on Earth — just to keep a crude one-decimal resolution.
Watch the middle panel. A generous million-example dataset gives you a fine grid in 2D, but in 20D it buys you barely two levels per axis. Your "big data" is sparse dust.
Watch the shell. In high d, almost all of a cube's volume sits in a thin skin near its surface. There is no "typical, central" point — every sample is an outlier on some boundary, and every pair of points drifts toward the same mutual distance. "Nearby" stops being a meaningful concept.

This is the formal result: to learn a 1-Lipschitz function f* (one where |f(x) − f(x′)| ≤ ‖x − x′‖, the natural "locality prior") under a Gaussian ν to accuracy ε, you need on the order of ε^(−d) samples. The statistical error of the Lipschitz class is cursed by dimension.

The curse attacks approximation too

Maybe the fix is to stop demanding the huge Lipschitz class and use neural networks instead? Consider the shallow-network class:

Shallow net

F = { f(x) = Σ_j≤m aⱼ ρ( wⱼᵀx + bⱼ ) }

γ(f) can be the number of neurons m, or a path norm Σ|aⱼ|(‖wⱼ‖ + |bⱼ|).

The Universal Approximation Theorem (Cybenko, Hornik, Pinkus, Barron, …) promises that as long as the activation ρ isn't a polynomial, this class is dense in continuous functions — given enough neurons you can approximate anything. Reassuring! But density says nothing about rates: how many neurons does a given accuracy cost?

And the rates are cursed again. For the Sobolev class Hˢ (functions with s bounded derivatives), the best achievable approximation error with m neurons is [DeVore, Maiorov]:

inf_g∈F sup_x∈K | f(x) − g(x) | = Θ( m^−s/d )

s = number of finite derivatives of f. To hit error ε you need m ≈ ε^(−d/s) neurons — exponential in d again.

That d in the exponent is Bellman, back for a second helping. Smoothness s helps, but unless your function is absurdly smooth, approximation cost still explodes with dimension.

There is one escape hatch. The Barron class — functions whose Fourier transform satisfies ∫ |f̂(ω)| ‖ω‖² dω < ∞ — enjoys a dimension-free rate of O(m^(−1)) [Barron]. No curse! But that integrability condition is a strong assumption that most real targets don't satisfy. It's a hint, not a solution.

The one curse we escape: optimization

Before the bleak summary, some good news. The third term, ε_opt, behaves.

Finding the global optimum of a generic high-dimensional non-convex function is NP-hard — the loss landscape can look like an endless jagged mountain range with exponentially many traps. By rights, training deep networks should be hopeless.

It isn't, for two reasons:

Many ML loss landscapes are "benign." They aren't arbitrary — they often have no bad local minima, so every valley you can fall into is roughly as good as the best one.
Gradient descent is efficient at finding local minima. A theorem of [Jin et al., 2017] shows noisy gradient descent reaches an ε-approximate second-order stationary point of a β-smooth function in Õ(β/ε²) iterations — with no dependence on dimension (up to log factors).

The takeaway: provided the model defines a benign landscape, optimization is not where the curse bites. This is a big part of why deep learning works at all — and it lets us focus the fight on the two terms that remain.

The squeeze: too big and too small at the same time

Put the pieces together and you see the bind precisely. We have two opposing pressures on the choice of function class:

Function class	Approximation (ε_appr)	Statistical (ε_stat)	Verdict
Lipschitz (locality only)	Fine — rich enough	Cursed: needs ε^(−d) samples	Too large
Sobolev / Barron (very smooth)	Cursed: needs ε^(−d/s) neurons	Fine — easy to estimate	Too small

The Lipschitz class is too large: it contains so many functions that finite data can't tell them apart, so the statistical error explodes. The smooth Sobolev/Barron classes are too small: they're easy to estimate, but real targets aren't smooth enough to live in them, so the approximation error explodes.

Classical regularity — "the function is smooth," "the function is Lipschitz" — is the mathematics of low dimensions. In high dimensions it forces you to choose between a class that's impossible to estimate and one that's impossible to fit. Neither works.

We need a different kind of assumption — a function class that is somehow rich and estimable at the same time. Where does it come from?

The escape: geometry

Here is the pivot of the entire story, and the reason Geometric Deep Learning exists.

The way out is to stop treating inputs as featureless vectors in ℝᵈ and start treating them as signals defined over a low-dimensional geometric domain Ω. An image isn't an abstract point in a million-dimensional pixel space — it's a signal x(u) over a 2D grid of pixel locations u. A molecule is a signal over a graph of atoms. A social network, a 3D shape, a sequence — each is a signal over a domain with its own structure.

x : Ω → ℝ^c, with domain Ω being a grid, graph, group, or manifold

Inputs are signals x(u) over a domain Ω — not arbitrary points in ℝᵈ.

That geometric domain hands us new notions of regularity — assumptions tied to the structure of Ω rather than to raw smoothness. Locality, stationarity (the same rule applies everywhere), and symmetry (the answer is unchanged under transformations like translation, rotation, or relabelling) are all low-dimensional priors that nonetheless carve out a function class rich enough to contain real targets. They thread the needle between "too large" and "too small."

These domains are the now-famous "5 Gs" of Geometric Deep Learning:

Grids

Images, audio, video → CNNs exploit translation symmetry

Graphs

Molecules, networks → GNNs exploit permutation symmetry

Groups

Spheres, global symmetries → equivariant networks

Geodesics & Gauges

Manifolds, 3D meshes → curvature-aware models

A convolution is exactly this idea in action: instead of a fully-connected layer with d² free weights (a generic high-dimensional function, fully cursed), it shares one small local filter across every position. Translation symmetry collapses the number of parameters from cursed to constant — and that's why it generalises from a realistic number of images. Message passing does the same thing for graphs, sharing one update rule across every node.

Takeaways

Four crisp conclusions are worth carrying around:

High-dimensional learning is impossible without assumptions — it's plagued by the curse of dimensionality at its core.
Classical regularity assumptions from low-dimensional mathematics don't transfer. Lipschitz is too large to estimate; Sobolev/Barron is too small to fit.
The escape is to view data as signals over low-dimensional geometric domains — grids, graphs, groups, manifolds.
Those domains supply new, structure-aware notions of regularity — locality, stationarity, symmetry — that are rich and estimable.

The curse of dimensionality isn't a footnote you can engineer around — it's the wall every high-dimensional learner runs into. Geometric Deep Learning is the realisation that the wall has a door, and the door is shaped like the geometry of your data. Next we'll start walking through it: the symmetries and structures that turn an impossible problem into a tractable one.

Share this post

Back to blog