ConvTranspose2d Explained: Learnable Upsampling in PyTorch
Prerequisites: Familiarity with Conv2d, feature maps, stride, and padding in CNNs.
Why Upsampling Matters
In many deep learning tasks — semantic segmentation, image generation, autoencoders — you need to increase spatial resolution, not reduce it. A standard Conv2d shrinks your feature maps. So how do you go the other direction?
You could use nearest-neighbor or bilinear interpolation, but those are fixed operations with no learnable parameters. ConvTranspose2d solves this by making upsampling learnable — the network figures out the best way to expand spatial dimensions during training.
Andrew Ng's lecture on transposed convolutions (referenced below) is an excellent companion to this post — he walks through the kernel-multiplication intuition step by step, which maps directly to the zero-insertion mechanic explained here.
You'll find ConvTranspose2d in:
- U-Net decoder path (semantic segmentation)
- DCGAN generator (image synthesis from noise)
- VAE decoder (latent space → full image)
What ConvTranspose2d Actually Does
Despite the name "deconvolution" floating around in older papers, this is not a true mathematical inverse of convolution. The accurate mental model has two steps:
- Insert zeros between input elements along spatial dimensions (controlled by
stride) - Apply a standard convolution over the result
This zero-insertion is what allows the output to be spatially larger than the input. When stride=2, a single input element "spreads" its influence across a 2×2 region in the output.
Input (2×2): After zero-insertion (stride=2):
1 2 1 0 2
3 4 0 0 0
3 0 4
The kernel then slides over this expanded tensor, learning how to fill in the interpolated values — which is what makes it superior to fixed upsampling methods.
This is exactly the mechanic Andrew Ng demonstrates visually in his U-Net lecture. Each input value gets multiplied by the full kernel and placed into the output grid — overlapping regions are summed.
The Matrix Transposition Connection
The name "transposed convolution" comes from linear algebra. If a regular convolution can be expressed as a matrix multiplication:
Y = C · X
...where C is the convolution matrix derived from the kernel, then the transposed convolution computes:
X̂ = Cᵀ · Y
This is why it's called transposed — it uses the transpose of the convolution matrix. It does not invert the values, only the shape transformation. You can verify: a Conv2d and a ConvTranspose2d with identical parameters are inverses in terms of shape, but not in terms of values.
The Output Size Formula
This is where most confusion happens. The output height and width are:
H_out = (H_in − 1) × stride[0]
− 2 × padding[0]
+ dilation[0] × (kernel_size[0] − 1)
+ output_padding[0]
+ 1
Concrete example — the most common configuration: kernel_size=3, stride=2, padding=1
H_in = 6
H_out = (6 − 1) × 2 − 2×1 + 1×(3−1) + 0 + 1
= 10 − 2 + 2 + 0 + 1
= 11 ← not 12!
You wanted 12, not 11? That's output_padding=1:
H_out = (6 − 1) × 2 − 2×1 + 1×(3−1) + 1 + 1
= 10 − 2 + 2 + 1 + 1
= 12 ✓
The padding Gotcha
In Conv2d, padding adds zeros around the input — making the output bigger.
In ConvTranspose2d, padding removes rows/columns from the output edges. This is counterintuitive but deliberate. Internally:
effective_zero_border = dilation × (kernel_size − 1) − padding
This design ensures that a Conv2d and ConvTranspose2d with the same parameters are shape-inverses of each other — which is exactly what you want in encoder-decoder architectures.
Common mistake: setting padding=1 expecting it to add border pixels.
It actually removes 1 pixel from each side of the output.
Always verify your output shape with the formula before training.
Rule of thumb: For clean 2× upsampling, use
kernel_size=4, stride=2, padding=1. This gives exactly2 × H_inwith no ambiguity and nooutput_paddingneeded.
The output_padding Parameter
When stride > 1, multiple input shapes can produce the same Conv2d output shape. So when reversing direction, there's ambiguity — which size were you originally at?
output_padding resolves this by adding extra rows/columns to one side of the output:
- Does not add learnable parameters
- Does not pad symmetrically
- Only affects the computed output shape
- Valid values:
0tostride − 1
# stride=2 → output_padding can be 0 or 1 only
upsample = nn.ConvTranspose2d(
16, 16,
kernel_size=3, stride=2,
padding=1, output_padding=1
)
Code Walkthrough: Symmetric Encoder-Decoder
import torch
import torch.nn as nn
input = torch.randn(1, 16, 12, 12) # [batch, channels, H, W]
# Encoder: halves spatial dims (12 → 6)
downsample = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
# Decoder: should restore spatial dims (6 → 12)
upsample = nn.ConvTranspose2d(16, 16, kernel_size=3, stride=2, padding=1)
h = downsample(input)
print(h.size()) # torch.Size([1, 16, 6, 6])
# Pass output_size to resolve shape ambiguity automatically
output = upsample(h, output_size=input.size())
print(output.size()) # torch.Size([1, 16, 12, 12])
Without output_size=input.size(), the upsample produces [1, 16, 11, 11] — one pixel short. Passing target size lets PyTorch compute output_padding automatically.
Key Parameters at a Glance
| Parameter | Role | Common values |
|---|---|---|
kernel_size | Size of convolution kernel | 3, 4 |
stride | Controls upsampling factor | 2 for 2× upsampling |
padding | Crops output border | 1 with kernel_size=3 or 4 |
output_padding | Resolves shape ambiguity | 0 or 1 (must be < stride) |
dilation | Spacing between kernel weights | 1 (default, rarely changed here) |
groups | Depthwise separability | 1 standard, in_channels for depthwise |
Common Configurations
# 1) Exact 2× upsampling — most common in practice
nn.ConvTranspose2d(C_in, C_out, kernel_size=4, stride=2, padding=1)
# H_out = (H_in − 1)*2 − 2 + 3 + 1 = 2*H_in ✓
# 2) Symmetric inverse of a Conv2d
nn.Conv2d(C, C, kernel_size=3, stride=2, padding=1) # encoder
nn.ConvTranspose2d(C, C, kernel_size=3, stride=2, padding=1) # decoder (use output_size)
# 3) Non-square upsampling
nn.ConvTranspose2d(16, 33, kernel_size=(3, 5), stride=(2, 1), padding=(4, 2))
Where It Appears in Real Architectures
U-Net (Ronneberger et al., 2015): The decoder path uses transposed convolutions to progressively restore spatial resolution from the bottleneck, concatenating skip connections from the encoder at each scale. Andrew Ng's deeplearning.ai course covers this architecture in detail, including the transposed convolution step.
DCGAN (Radford et al., 2015): The generator transforms a 100-dim noise vector into a full image through successive ConvTranspose2d layers — each doubling spatial resolution and halving channel count.
Semantic Segmentation heads: FCN, DeepLab variants, and SegFormer decoder heads all use transposed convolutions (or bilinear upsampling + conv) to match output resolution to input resolution.
Summary
ConvTranspose2dis learnable upsampling, not true deconvolution- It works by inserting zeros between inputs then convolving — enabling spatial expansion
paddingremoves output border pixels (opposite of Conv2d intuition)- Use
output_paddingoroutput_sizeinforward()to resolve shape ambiguity whenstride > 1 - For clean 2× upsampling:
kernel_size=4, stride=2, padding=1is the reliable recipe
References
Share this post
More posts
The Eye as a Window: How AI is Transforming Retinal Diagnosis
The retina is the only place in the body where you can directly observe neurons and blood vessels without a needle or a scalpel. AI is turning that biological accident into a revolution in non-invasive diagnostics.
From Prompt Engineering to Context Engineering: The Shift You Already Feel
If you've written a CLAUDE.md file, you're not prompting anymore — you're engineering context. Here's what that means and why everything is changing in 2026.