Mausam.
HomeAboutExperienceProjectsPublicationsSkillsBlogContact
Mausam.

Machine learning engineer & researcher based in Nepal. Exploring computer vision and medical imaging in fundus images.

Quick Links

  • About
  • Experience
  • Projects
  • Publications
  • Blog
  • Contact

Connect

hello@mausamgrg.com.np

© 2026 Mausam Gurung. All rights reserved.

Back to blog
Deep Learning · GNN

Message Passing: How Graph Neural Networks Actually Think

Every GNN — from drug discovery to your recommendation feed — runs on one simple loop. Here you can step through it by hand.

June 20268 min readGNN, Graphs, Message Passing

In the adjacency matrix post we saw how a graph gets translated into numbers a neural network can read. But encoding the graph is only the setup. The real question is: once a network has the graph, how does it actually learn from it?

The answer is a single, beautifully simple idea called message passing. Almost every graph neural network you've heard of — the ones predicting protein structure, ranking your feed, or flagging fraud — is just this one loop, repeated.

A node, on its own, knows almost nothing. Message passing lets it borrow knowledge from its neighbors, one hop at a time.

The one-sentence version

Each node looks at its neighbors, gathers what they know, and updates its own understanding. Then everyone does it again.

That's it. Run that loop a few times and something remarkable happens: information ripples outward across the graph. After one round, a node knows about its immediate neighbors. After two, it knows about its neighbors' neighbors. After k rounds, it has heard from everything within k hops.

Three moves: message, aggregate, update

To make this precise, message passing splits each round into three steps. Let hᵥ be the feature vector of node v — a list of numbers describing it — and let N(v) be its set of neighbors.

The Geometric Deep Learning literature has a standard notation for the three functions, which I'll use here: ψ (psi) is the message function, ⊕ is a permutation-invariant aggregator, and φ (phi) is the update function.

1. Message — every neighbor u prepares something to send to v, using the message function ψ:

Message

mu→v = ψ(hv, hu)

2. Aggregate — node v combines all incoming messages into one summary with ⊕. This step must be order-independent (your neighbors have no natural ranking), so ⊕ is a symmetric function like sum, mean, or max:

Aggregate

av = ⊕u∈N(v) mu→v

3. Update — node v mixes the summary with its own current state via the update function φ to produce its new state:

Update

hv(k+1) = φ(hv(k), av)

Folded into a single line, that's the canonical message-passing equation you'll see across the GDL literature:

hv(k+1) = φ( hv(k), ⊕u∈N(v) ψ(hv(k), hu(k)) )

Every GNN variant — GCN, GraphSAGE, GAT — is just a different choice of ψ, ⊕, and φ.

See it move

Reading the equations is one thing. Watching a signal physically spread across a graph is another. The playground below is a tiny network where every node holds a single number — its feature.

Node A starts "lit up" at 5; everyone else is cold at 0. Hit Pass messages to run one round, and watch the value diffuse outward. Click any node to inject a fresh signal there instead. Hover a node (or table row) to see the exact aggregation it's computing this round.

Message Passing Playground

Click a node to inject a signal, then step the rounds and watch it spread

Aggregate

Round k = 0

brighter = higher feature value

A5B0C0D0E0F0

What node A computes

Round 0 — the initial state. Hit Pass messages to run one round of aggregation.

Feature values per round

nodek0
A5
B0
C0
D0
E0
F0

Try mean + include self and keep stepping — every node drifts toward the same value. That smoothing is exactly the over-smoothing that limits how deep GNNs can go.

💡 Each round = one GNN layer. After k rounds, every node has heard from neighbors up to k hops away.

A few things worth doing:

  • Step slowly with "mean". Notice how A's signal reaches B and C after one round, then D, E, F after two. That's the k-hop reach in action.
  • Switch to "sum". Values explode instead of averaging — which is why real GNNs normalize (more on that below).
  • Keep stepping. With mean + include self, every node eventually converges to nearly the same value. The graph forgets who started hot. This is over-smoothing, and it's the reason most GNNs are only 2–4 layers deep.

The math, connected back to the matrix

Here's the satisfying part. Remember from the adjacency-matrix post that A · X sums each node's neighbor features in one matrix multiply? That is the aggregate step.

A single layer of the most common GNN — the Graph Convolutional Network (GCN) — is just:

GCN layer

H(k+1) = σ( Â · H(k) · W )

Breaking that down:

  • H⁽ᵏ⁾ — the stacked feature vectors of every node at round k
  • Â — the normalized adjacency matrix, D⁻½ A D⁻½, which does the neighbor aggregation without letting high-degree nodes dominate
  • W — a learned weight matrix (this is the part training actually adjusts)
  • σ — a nonlinearity like ReLU

So · W is the message function ψ, Â · is the aggregator ⊕, and σ(...) is the update φ. The three abstract moves collapse into one tidy line of linear algebra — which is exactly why GNNs run efficiently on GPUs.

The mean aggregation in the playground is essentially the  · H step. That smoothing you watched is a real, predictable property of repeatedly multiplying by the normalized adjacency matrix.

The same idea, in PyTorch

Frameworks like PyTorch Geometric give you message passing as a base class. You only fill in the three functions — the library handles routing messages along the edges:

Python
1
import torch
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree

class GCNLayer(MessagePassing):
    def __init__(self, in_dim, out_dim):
        super().__init__(aggr="add")          # AGGREGATE = sum
        self.lin = torch.nn.Linear(in_dim, out_dim)

    def forward(self, x, edge_index):
        # add self-loops so a node also keeps its own features
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))

        x = self.lin(x)                        # MESSAGE transform (the W)

        # symmetric normalization: D^-1/2 A D^-1/2
        row, col = edge_index
        deg = degree(col, x.size(0), dtype=x.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]

        # propagate() runs message -> aggregate -> update for us
        return self.propagate(edge_index, x=x, norm=norm)

    def message(self, x_j, norm):
        # x_j = features of neighbor j, scaled by the normalization
        return norm.view(-1, 1) * x_j

    def update(self, aggr_out):
        # UPDATE: here we just pass the aggregated result through
        return aggr_out

Notice the shape: message defines what a neighbor sends, the aggr="add" setting picks the aggregator, and update decides what to do with the result. That's the same message → aggregate → update loop you just stepped through by hand — only now it runs on millions of nodes and learns W by backprop.

Why the aggregator choice matters

The aggregate step has to be permutation-invariant — reorder a node's neighbors and the answer can't change. That rules out anything position-dependent and leaves a small menu, each with a personality:

Sum
Keeps track of how many neighbors — great for counting structure, but values can blow up
Mean
Stable and smooth, ignores neighbor count — the GCN default
Max
Picks out the single strongest signal — good for "is any neighbor X?"
Attention
Learns how much to weight each neighbor — the idea behind GATs

Flip between sum, mean, and max in the playground and you can feel the difference: sum grows without bound, mean settles into consensus, max latches onto the brightest node nearby.

The honest limits

Message passing is powerful, but it isn't magic.

  • Over-smoothing. Stack too many layers and every node converges to the same vector — exactly what you saw when you kept stepping. Real depth is usually 2–4 layers.
  • Over-squashing. Information from far-away nodes has to funnel through bottleneck edges, getting compressed into a fixed-size vector. Long-range signals get lost.
  • Limited expressiveness. Standard message passing can't always tell certain graph structures apart — it's provably no stronger than a classic graph-isomorphism test (the Weisfeiler–Leman algorithm).

These aren't dealbreakers; they're the active research frontier. Graph transformers, residual connections, and rewiring tricks all exist to push past these walls.

The takeaway

Strip away the jargon and every graph neural network is doing the same humble thing: nodes talking to their neighbors, one round at a time, until local conversations add up to a global understanding.

Message → aggregate → update. Run it k times. That's the whole engine.

The adjacency matrix tells a network who is connected. Message passing is how that network turns connection into knowledge — and now you've stepped through it yourself, one round at a time.

Share this post