Mausam.
HomeAboutExperienceProjectsPublicationsSkillsBlogContact
Mausam.

Machine learning engineer & researcher based in Nepal. Exploring computer vision and medical imaging in fundus images.

Quick Links

  • About
  • Experience
  • Projects
  • Publications
  • Blog
  • Contact

Connect

hello@mausamgrg.com.np

© 2026 Mausam Gurung. All rights reserved.

Back to blog

ConvTranspose2d Explained: Learnable Upsampling in PyTorch

March 21, 2026·7 min read
deep-learningpytorchcomputer-visioncnn

Prerequisites: Familiarity with Conv2d, feature maps, stride, and padding in CNNs.


Why Upsampling Matters

In many deep learning tasks — semantic segmentation, image generation, autoencoders — you need to increase spatial resolution, not reduce it. A standard Conv2d shrinks your feature maps. So how do you go the other direction?

You could use nearest-neighbor or bilinear interpolation, but those are fixed operations with no learnable parameters. ConvTranspose2d solves this by making upsampling learnable — the network figures out the best way to expand spatial dimensions during training.

Andrew Ng's lecture on transposed convolutions (referenced below) is an excellent companion to this post — he walks through the kernel-multiplication intuition step by step, which maps directly to the zero-insertion mechanic explained here.

You'll find ConvTranspose2d in:

  • U-Net decoder path (semantic segmentation)
  • DCGAN generator (image synthesis from noise)
  • VAE decoder (latent space → full image)

What ConvTranspose2d Actually Does

Despite the name "deconvolution" floating around in older papers, this is not a true mathematical inverse of convolution. The accurate mental model has two steps:

  1. Insert zeros between input elements along spatial dimensions (controlled by stride)
  2. Apply a standard convolution over the result

This zero-insertion is what allows the output to be spatially larger than the input. When stride=2, a single input element "spreads" its influence across a 2×2 region in the output.

Input (2×2):          After zero-insertion (stride=2):

  1  2                  1  0  2
  3  4                  0  0  0
                        3  0  4

The kernel then slides over this expanded tensor, learning how to fill in the interpolated values — which is what makes it superior to fixed upsampling methods.

This is exactly the mechanic Andrew Ng demonstrates visually in his U-Net lecture. Each input value gets multiplied by the full kernel and placed into the output grid — overlapping regions are summed.


The Matrix Transposition Connection

The name "transposed convolution" comes from linear algebra. If a regular convolution can be expressed as a matrix multiplication:

Y = C · X

...where C is the convolution matrix derived from the kernel, then the transposed convolution computes:

X̂ = Cᵀ · Y

This is why it's called transposed — it uses the transpose of the convolution matrix. It does not invert the values, only the shape transformation. You can verify: a Conv2d and a ConvTranspose2d with identical parameters are inverses in terms of shape, but not in terms of values.


The Output Size Formula

This is where most confusion happens. The output height and width are:

H_out = (H_in − 1) × stride[0]
        − 2 × padding[0]
        + dilation[0] × (kernel_size[0] − 1)
        + output_padding[0]
        + 1

Concrete example — the most common configuration: kernel_size=3, stride=2, padding=1

H_in = 6

H_out = (6 − 1) × 2 − 2×1 + 1×(3−1) + 0 + 1
      = 10 − 2 + 2 + 0 + 1
      = 11   ← not 12!

You wanted 12, not 11? That's output_padding=1:

H_out = (6 − 1) × 2 − 2×1 + 1×(3−1) + 1 + 1
      = 10 − 2 + 2 + 1 + 1
      = 12  ✓

The padding Gotcha

In Conv2d, padding adds zeros around the input — making the output bigger.

In ConvTranspose2d, padding removes rows/columns from the output edges. This is counterintuitive but deliberate. Internally:

effective_zero_border = dilation × (kernel_size − 1) − padding

This design ensures that a Conv2d and ConvTranspose2d with the same parameters are shape-inverses of each other — which is exactly what you want in encoder-decoder architectures.

Common mistake: setting padding=1 expecting it to add border pixels. It actually removes 1 pixel from each side of the output. Always verify your output shape with the formula before training.

Rule of thumb: For clean 2× upsampling, use kernel_size=4, stride=2, padding=1. This gives exactly 2 × H_in with no ambiguity and no output_padding needed.


The output_padding Parameter

When stride > 1, multiple input shapes can produce the same Conv2d output shape. So when reversing direction, there's ambiguity — which size were you originally at?

output_padding resolves this by adding extra rows/columns to one side of the output:

  • Does not add learnable parameters
  • Does not pad symmetrically
  • Only affects the computed output shape
  • Valid values: 0 to stride − 1
# stride=2 → output_padding can be 0 or 1 only
upsample = nn.ConvTranspose2d(
    16, 16,
    kernel_size=3, stride=2,
    padding=1, output_padding=1
)

Code Walkthrough: Symmetric Encoder-Decoder

import torch
import torch.nn as nn

input = torch.randn(1, 16, 12, 12)   # [batch, channels, H, W]

# Encoder: halves spatial dims (12 → 6)
downsample = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)

# Decoder: should restore spatial dims (6 → 12)
upsample = nn.ConvTranspose2d(16, 16, kernel_size=3, stride=2, padding=1)

h = downsample(input)
print(h.size())   # torch.Size([1, 16, 6, 6])

# Pass output_size to resolve shape ambiguity automatically
output = upsample(h, output_size=input.size())
print(output.size())   # torch.Size([1, 16, 12, 12])

Without output_size=input.size(), the upsample produces [1, 16, 11, 11] — one pixel short. Passing target size lets PyTorch compute output_padding automatically.


Key Parameters at a Glance

ParameterRoleCommon values
kernel_sizeSize of convolution kernel3, 4
strideControls upsampling factor2 for 2× upsampling
paddingCrops output border1 with kernel_size=3 or 4
output_paddingResolves shape ambiguity0 or 1 (must be < stride)
dilationSpacing between kernel weights1 (default, rarely changed here)
groupsDepthwise separability1 standard, in_channels for depthwise

Common Configurations

# 1) Exact 2× upsampling — most common in practice
nn.ConvTranspose2d(C_in, C_out, kernel_size=4, stride=2, padding=1)
# H_out = (H_in − 1)*2 − 2 + 3 + 1 = 2*H_in  ✓

# 2) Symmetric inverse of a Conv2d
nn.Conv2d(C, C, kernel_size=3, stride=2, padding=1)          # encoder
nn.ConvTranspose2d(C, C, kernel_size=3, stride=2, padding=1) # decoder (use output_size)

# 3) Non-square upsampling
nn.ConvTranspose2d(16, 33, kernel_size=(3, 5), stride=(2, 1), padding=(4, 2))

Where It Appears in Real Architectures

U-Net (Ronneberger et al., 2015): The decoder path uses transposed convolutions to progressively restore spatial resolution from the bottleneck, concatenating skip connections from the encoder at each scale. Andrew Ng's deeplearning.ai course covers this architecture in detail, including the transposed convolution step.

DCGAN (Radford et al., 2015): The generator transforms a 100-dim noise vector into a full image through successive ConvTranspose2d layers — each doubling spatial resolution and halving channel count.

Semantic Segmentation heads: FCN, DeepLab variants, and SegFormer decoder heads all use transposed convolutions (or bilinear upsampling + conv) to match output resolution to input resolution.


Summary

  • ConvTranspose2d is learnable upsampling, not true deconvolution
  • It works by inserting zeros between inputs then convolving — enabling spatial expansion
  • padding removes output border pixels (opposite of Conv2d intuition)
  • Use output_padding or output_size in forward() to resolve shape ambiguity when stride > 1
  • For clean 2× upsampling: kernel_size=4, stride=2, padding=1 is the reliable recipe

References

  • PyTorch ConvTranspose2d Documentation
  • Andrew Ng — Transposed Convolutions (deeplearning.ai)
  • Dumoulin & Visin — A Guide to Convolution Arithmetic (arXiv:1603.07285)
  • Dive into Deep Learning — Transposed Convolution
  • Ronneberger et al. — U-Net (arXiv:1505.04597)

Share this post

More posts

The Eye as a Window: How AI is Transforming Retinal Diagnosis

The retina is the only place in the body where you can directly observe neurons and blood vessels without a needle or a scalpel. AI is turning that biological accident into a revolution in non-invasive diagnostics.

From Prompt Engineering to Context Engineering: The Shift You Already Feel

If you've written a CLAUDE.md file, you're not prompting anymore — you're engineering context. Here's what that means and why everything is changing in 2026.