Message Passing: How Graph Neural Networks Actually Think
Every GNN — from drug discovery to your recommendation feed — runs on one simple loop. Here you can step through it by hand.
In the adjacency matrix post we saw how a graph gets translated into numbers a neural network can read. But encoding the graph is only the setup. The real question is: once a network has the graph, how does it actually learn from it?
The answer is a single, beautifully simple idea called message passing. Almost every graph neural network you've heard of — the ones predicting protein structure, ranking your feed, or flagging fraud — is just this one loop, repeated.
A node, on its own, knows almost nothing. Message passing lets it borrow knowledge from its neighbors, one hop at a time.
The one-sentence version
Each node looks at its neighbors, gathers what they know, and updates its own understanding. Then everyone does it again.
That's it. Run that loop a few times and something remarkable happens: information ripples outward across the graph. After one round, a node knows about its immediate neighbors. After two, it knows about its neighbors' neighbors. After k rounds, it has heard from everything within k hops.
Three moves: message, aggregate, update
To make this precise, message passing splits each round into three steps. Let hᵥ be the feature vector of node v — a list of numbers describing it — and let N(v) be its set of neighbors.
The Geometric Deep Learning literature has a standard notation for the three functions, which I'll use here: ψ (psi) is the message function, ⊕ is a permutation-invariant aggregator, and φ (phi) is the update function.
1. Message — every neighbor u prepares something to send to v, using the message function ψ:
mu→v = ψ(hv, hu)
2. Aggregate — node v combines all incoming messages into one summary with ⊕. This step must be order-independent (your neighbors have no natural ranking), so ⊕ is a symmetric function like sum, mean, or max:
av = ⊕u∈N(v) mu→v
3. Update — node v mixes the summary with its own current state via the update function φ to produce its new state:
hv(k+1) = φ(hv(k), av)
Folded into a single line, that's the canonical message-passing equation you'll see across the GDL literature:
hv(k+1) = φ( hv(k), ⊕u∈N(v) ψ(hv(k), hu(k)) )
See it move
Reading the equations is one thing. Watching a signal physically spread across a graph is another. The playground below is a tiny network where every node holds a single number — its feature.
Node A starts "lit up" at 5; everyone else is cold at 0. Hit Pass messages to run one round, and watch the value diffuse outward. Click any node to inject a fresh signal there instead. Hover a node (or table row) to see the exact aggregation it's computing this round.
Message Passing Playground
Click a node to inject a signal, then step the rounds and watch it spread
Round k = 0
brighter = higher feature value
What node A computes
Feature values per round
| node | k0 |
|---|---|
| A | 5 |
| B | 0 |
| C | 0 |
| D | 0 |
| E | 0 |
| F | 0 |
Try mean + include self and keep stepping — every node drifts toward the same value. That smoothing is exactly the over-smoothing that limits how deep GNNs can go.
💡 Each round = one GNN layer. After k rounds, every node has heard from neighbors up to k hops away.
A few things worth doing:
- Step slowly with "mean". Notice how A's signal reaches B and C after one round, then D, E, F after two. That's the
k-hop reach in action. - Switch to "sum". Values explode instead of averaging — which is why real GNNs normalize (more on that below).
- Keep stepping. With mean + include self, every node eventually converges to nearly the same value. The graph forgets who started hot. This is over-smoothing, and it's the reason most GNNs are only 2–4 layers deep.
The math, connected back to the matrix
Here's the satisfying part. Remember from the adjacency-matrix post that A · X sums each node's neighbor features in one matrix multiply? That is the aggregate step.
A single layer of the most common GNN — the Graph Convolutional Network (GCN) — is just:
H(k+1) = σ( Â · H(k) · W )
Breaking that down:
H⁽ᵏ⁾— the stacked feature vectors of every node at roundk— the normalized adjacency matrix,D⁻½ A D⁻½, which does the neighbor aggregation without letting high-degree nodes dominateW— a learned weight matrix (this is the part training actually adjusts)σ— a nonlinearity like ReLU
So · W is the message function ψ, Â · is the aggregator ⊕, and σ(...) is the update φ. The three abstract moves collapse into one tidy line of linear algebra — which is exactly why GNNs run efficiently on GPUs.
The mean aggregation in the playground is essentially the  · H step. That smoothing you watched is a real, predictable property of repeatedly multiplying by the normalized adjacency matrix.
The same idea, in PyTorch
Frameworks like PyTorch Geometric give you message passing as a base class. You only fill in the three functions — the library handles routing messages along the edges:
import torch
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree
class GCNLayer(MessagePassing):
def __init__(self, in_dim, out_dim):
super().__init__(aggr="add") # AGGREGATE = sum
self.lin = torch.nn.Linear(in_dim, out_dim)
def forward(self, x, edge_index):
# add self-loops so a node also keeps its own features
edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
x = self.lin(x) # MESSAGE transform (the W)
# symmetric normalization: D^-1/2 A D^-1/2
row, col = edge_index
deg = degree(col, x.size(0), dtype=x.dtype)
deg_inv_sqrt = deg.pow(-0.5)
norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
# propagate() runs message -> aggregate -> update for us
return self.propagate(edge_index, x=x, norm=norm)
def message(self, x_j, norm):
# x_j = features of neighbor j, scaled by the normalization
return norm.view(-1, 1) * x_j
def update(self, aggr_out):
# UPDATE: here we just pass the aggregated result through
return aggr_out
Notice the shape: message defines what a neighbor sends, the aggr="add" setting picks the aggregator, and update decides what to do with the result. That's the same message → aggregate → update loop you just stepped through by hand — only now it runs on millions of nodes and learns W by backprop.
Why the aggregator choice matters
The aggregate step has to be permutation-invariant — reorder a node's neighbors and the answer can't change. That rules out anything position-dependent and leaves a small menu, each with a personality:
Flip between sum, mean, and max in the playground and you can feel the difference: sum grows without bound, mean settles into consensus, max latches onto the brightest node nearby.
The honest limits
Message passing is powerful, but it isn't magic.
- Over-smoothing. Stack too many layers and every node converges to the same vector — exactly what you saw when you kept stepping. Real depth is usually 2–4 layers.
- Over-squashing. Information from far-away nodes has to funnel through bottleneck edges, getting compressed into a fixed-size vector. Long-range signals get lost.
- Limited expressiveness. Standard message passing can't always tell certain graph structures apart — it's provably no stronger than a classic graph-isomorphism test (the Weisfeiler–Leman algorithm).
These aren't dealbreakers; they're the active research frontier. Graph transformers, residual connections, and rewiring tricks all exist to push past these walls.
The takeaway
Strip away the jargon and every graph neural network is doing the same humble thing: nodes talking to their neighbors, one round at a time, until local conversations add up to a global understanding.
Message → aggregate → update. Run it k times. That's the whole engine.
The adjacency matrix tells a network who is connected. Message passing is how that network turns connection into knowledge — and now you've stepped through it yourself, one round at a time.