ConvTranspose2d Explained: Learnable Upsampling in PyTorch
In deep learning, we spend a lot of time shrinking feature maps — pooling layers and strided convolutions reduce spatial dimensions to extract high-level features. But what happens when you need to go the other direction?
Tasks like semantic segmentation, image generation, and autoencoders all need to increase spatial resolution at some point. That is where ConvTranspose2d comes in.
This post breaks down exactly how it works, why the parameters behave differently than you might expect, and how to use it correctly in practice.
Who is this for? If you understand how a regular Conv2d works (kernel, stride, padding), you have everything you need. I will build up from there — no prior knowledge of transposed convolutions required.
The Problem: How Do You Upsample?
A standard Conv2d with stride=2 halves your spatial dimensions. Going from a 12x12 feature map to 6x6 is straightforward. But going from 6x6 back to 12x12? That is the upsampling problem.
There are a few options:
| Method | Learnable? | Quality |
|---|---|---|
| Nearest-neighbor interpolation | No | Blocky artifacts |
| Bilinear interpolation | No | Smooth but fixed |
ConvTranspose2d | Yes | Network learns the best upsampling |
The key advantage of ConvTranspose2d is that it introduces learnable parameters into the upsampling step. Instead of using a fixed interpolation formula, the network figures out the best way to expand spatial dimensions during training.
How ConvTranspose2d Actually Works
Despite the name "deconvolution" that floats around in older papers, this is not a true mathematical inverse of convolution. The accurate mental model has two steps [3]:
Step 1: Insert Zeros
Between every input element, insert zeros along both spatial dimensions. The number of zeros inserted is controlled by stride.
Input (2x2): After zero-insertion (stride=2):
1 2 1 0 2
3 4 0 0 0
3 0 4
Step 2: Apply a Standard Convolution
A regular convolution kernel slides over this expanded tensor. The kernel learns how to fill in the gaps — which is what makes it superior to fixed upsampling methods.
The result: the output is spatially larger than the input. When stride=2, each input element "spreads" its influence across a 2x2 region in the output.
Intuition: Think of it as each input pixel getting "multiplied" by the full kernel and placed into the output grid. Where regions overlap, the values are summed. This is exactly the mechanic Andrew Ng demonstrates in his U-Net lecture [2].
Why Is It Called "Transposed" Convolution?
The name comes from linear algebra [3]. If we express a regular convolution as matrix multiplication:
Y = C · X (forward convolution)
where C is the convolution matrix derived from the kernel, then the transposed convolution computes:
X̂ = Cᵀ · Y (transposed convolution)
It uses the transpose of the convolution matrix. This is important to understand:
- It reverses the shape transformation (if Conv2d maps 12x12 to 6x6, ConvTranspose2d maps 6x6 back to 12x12)
- It does NOT reverse the values (you do not get the original input back)
A Conv2d and a ConvTranspose2d with identical parameters are inverses in terms of shape, not content. For a thorough visual walkthrough of this relationship, see [4].
The Output Size Formula
This is where most people get tripped up. The output height (and similarly width) is [1]:
H_out = (H_in - 1) x stride - 2 x padding + dilation x (kernel_size - 1) + output_padding + 1
That looks intimidating, so let's walk through a concrete example.
Example: The Most Common Configuration
kernel_size=3, stride=2, padding=1 with an input of height 6:
H_out = (6 - 1) x 2 - 2 x 1 + 1 x (3 - 1) + 0 + 1
= 10 - 2 + 2 + 0 + 1
= 11
Wait — we wanted 12, not 11. This is the shape ambiguity problem, and it is solved by output_padding:
H_out = (6 - 1) x 2 - 2 x 1 + 1 x (3 - 1) + 1 + 1
= 10 - 2 + 2 + 1 + 1
= 12 ✓
Common trap: Multiple input shapes can produce the same Conv2d output, so when reversing direction there is inherent ambiguity. Always verify your output shape with the formula or use output_size in the forward pass (shown below).
The Padding Gotcha
This catches almost everyone the first time.
In Conv2d, padding adds zeros around the input, making the output larger.
In ConvTranspose2d, padding removes rows and columns from the output edges, making the output smaller.
Yes, it is the opposite behavior. Here is why: internally, the operation computes
effective_zero_border = dilation x (kernel_size - 1) - padding
This deliberate design ensures that a Conv2d and a ConvTranspose2d with the same parameters produce inverse shapes — exactly what you want for encoder-decoder architectures like U-Net [5].
Mistake to avoid: Setting padding=1 expecting it to add border pixels. It actually removes 1 pixel from each side of the output. Always verify your output shape before training.
The output_padding Parameter
When stride > 1, multiple input sizes can map to the same Conv2d output size. For example, both a 11x11 and 12x12 input produce a 6x6 output with kernel_size=3, stride=2, padding=1. So when going in reverse, which size should we produce?
output_padding resolves this ambiguity [1]:
- It adds extra rows/columns to one side of the output
- It does not add learnable parameters
- It does not pad symmetrically
- Valid values are
0tostride - 1
# stride=2 means output_padding can be 0 or 1
upsample = nn.ConvTranspose2d(
in_channels=16,
out_channels=16,
kernel_size=3,
stride=2,
padding=1,
output_padding=1 # adds 1 extra row and column
)
Code Walkthrough: Symmetric Encoder-Decoder
Here is a complete, runnable example showing how Conv2d and ConvTranspose2d work as shape inverses:
import torch
import torch.nn as nn
# Start with a 12x12 feature map
input = torch.randn(1, 16, 12, 12) # [batch, channels, H, W]
# Encoder: halves spatial dims (12 -> 6)
downsample = nn.Conv2d(
in_channels=16,
out_channels=16,
kernel_size=3,
stride=2,
padding=1
)
# Decoder: restores spatial dims (6 -> 12)
upsample = nn.ConvTranspose2d(
in_channels=16,
out_channels=16,
kernel_size=3,
stride=2,
padding=1
)
h = downsample(input)
print(h.size()) # torch.Size([1, 16, 6, 6])
# Pass output_size to resolve shape ambiguity automatically
output = upsample(h, output_size=input.size())
print(output.size()) # torch.Size([1, 16, 12, 12])
Pro tip: Passing output_size=input.size() in the forward call lets PyTorch compute output_padding automatically [1]. This is cleaner than hardcoding output_padding=1 because it adapts if your input size changes.
Without output_size, the upsample would produce [1, 16, 11, 11] — one pixel short in each dimension.
Quick Reference: Key Parameters
| Parameter | What it does | Common values |
|---|---|---|
kernel_size | Size of the convolution kernel | 3 or 4 |
stride | Controls the upsampling factor | 2 for 2x upsampling |
padding | Removes output border pixels | 1 with kernel 3 or 4 |
output_padding | Resolves shape ambiguity | 0 or 1 (must be < stride) |
dilation | Spacing between kernel weights | 1 (rarely changed for upsampling) |
groups | Depthwise separability | 1 (standard) |
Common Configurations You Will Use
# Configuration 1: Clean 2x upsampling (most common in practice)
# Produces exactly 2 * H_in output — no ambiguity, no output_padding needed
nn.ConvTranspose2d(C_in, C_out, kernel_size=4, stride=2, padding=1)
# Configuration 2: Symmetric inverse of Conv2d (use with output_size)
nn.Conv2d(C, C, kernel_size=3, stride=2, padding=1) # encoder
nn.ConvTranspose2d(C, C, kernel_size=3, stride=2, padding=1) # decoder
# Configuration 3: Non-square upsampling
nn.ConvTranspose2d(16, 33, kernel_size=(3, 5), stride=(2, 1), padding=(4, 2))
Rule of thumb: For clean 2x upsampling with no shape ambiguity, use kernel_size=4, stride=2, padding=1. This gives exactly 2 * H_in every time.
Where You Will See It in Real Architectures
U-Net
The decoder path uses transposed convolutions to progressively restore spatial resolution from the bottleneck [5]. At each scale, the upsampled features are concatenated with skip connections from the corresponding encoder level. This is the architecture behind most modern medical image segmentation. Andrew Ng's deeplearning.ai course covers this architecture in detail, including the transposed convolution step [2].
DCGAN
The generator transforms a 100-dimensional noise vector into a full image through successive ConvTranspose2d layers. Each layer doubles spatial resolution and halves the channel count — going from (100, 1, 1) to (3, 64, 64) in four steps. The convolution arithmetic guide [3] provides detailed visualizations of how each layer expands the spatial dimensions.
VAE Decoders
Variational autoencoders use transposed convolutions in their decoder to map from a compact latent space back to full image resolution. The learnable upsampling is important here because the decoder needs to reconstruct fine-grained spatial details from a compressed representation.
Semantic Segmentation Heads
FCN, DeepLab variants, and SegFormer decoder heads all use some form of learned upsampling (either transposed convolutions or bilinear upsampling followed by a 1x1 conv) to match the output resolution to the input image resolution.
Key Takeaways
ConvTranspose2dis learnable upsampling — not a true deconvolution or mathematical inverse- It works by inserting zeros between input elements, then applying a standard convolution [3]
paddingbehaves opposite to Conv2d — it removes output pixels instead of adding them- Use
output_paddingor passoutput_sizein the forward call to resolve shape ambiguity whenstride > 1[1] - For clean 2x upsampling with no surprises:
kernel_size=4, stride=2, padding=1
References
[1] PyTorch Contributors. ConvTranspose2d — PyTorch Documentation. pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html
[2] Andrew Ng. Transposed Convolutions. deeplearning.ai. youtube.com/watch?v=qb4nRoEAASA
[3] Dumoulin, V. & Visin, F. A Guide to Convolution Arithmetic for Deep Learning. arXiv:1603.07285, 2016. arxiv.org/abs/1603.07285
[4] Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. Dive into Deep Learning — Transposed Convolution. d2l.ai/chapter_computer-vision/transposed-conv.html
[5] Ronneberger, O., Fischer, P., & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015. arxiv.org/abs/1505.04597