ConvTranspose2d Explained: Learnable Upsampling in PyTorch

Prerequisites: Familiarity with Conv2d, feature maps, stride, and padding in CNNs.

Why Upsampling Matters

In many deep learning tasks — semantic segmentation, image generation, autoencoders — you need to increase spatial resolution, not reduce it. A standard Conv2d shrinks your feature maps. So how do you go the other direction?

You could use nearest-neighbor or bilinear interpolation, but those are fixed operations with no learnable parameters. ConvTranspose2d solves this by making upsampling learnable — the network figures out the best way to expand spatial dimensions during training.

Andrew Ng's lecture on transposed convolutions (referenced below) is an excellent companion to this post — he walks through the kernel-multiplication intuition step by step, which maps directly to the zero-insertion mechanic explained here.

You'll find ConvTranspose2d in:

U-Net decoder path (semantic segmentation)
DCGAN generator (image synthesis from noise)
VAE decoder (latent space → full image)

What ConvTranspose2d Actually Does

Despite the name "deconvolution" floating around in older papers, this is not a true mathematical inverse of convolution. The accurate mental model has two steps:

Insert zeros between input elements along spatial dimensions (controlled by stride)
Apply a standard convolution over the result

This zero-insertion is what allows the output to be spatially larger than the input. When stride=2, a single input element "spreads" its influence across a 2×2 region in the output.

Code

Input (2×2):          After zero-insertion (stride=2):

  1  2                  1  0  2
  3  4                  0  0  0
                        3  0  4

The kernel then slides over this expanded tensor, learning how to fill in the interpolated values — which is what makes it superior to fixed upsampling methods.

This is exactly the mechanic Andrew Ng demonstrates visually in his U-Net lecture. Each input value gets multiplied by the full kernel and placed into the output grid — overlapping regions are summed.

The Matrix Transposition Connection

The name "transposed convolution" comes from linear algebra. If a regular convolution can be expressed as a matrix multiplication:

Code

Y = C · X

...where C is the convolution matrix derived from the kernel, then the transposed convolution computes:

Code

X̂ = Cᵀ · Y

This is why it's called transposed — it uses the transpose of the convolution matrix. It does not invert the values, only the shape transformation. You can verify: a Conv2d and a ConvTranspose2d with identical parameters are inverses in terms of shape, but not in terms of values.

The Output Size Formula

This is where most confusion happens. The output height and width are:

Code

H_out = (H_in − 1) × stride[0]
        − 2 × padding[0]
        + dilation[0] × (kernel_size[0] − 1)
        + output_padding[0]
        + 1

Concrete example — the most common configuration: kernel_size=3, stride=2, padding=1

Code

H_in = 6

H_out = (6 − 1) × 2 − 2×1 + 1×(3−1) + 0 + 1
      = 10 − 2 + 2 + 0 + 1
      = 11   ← not 12!

You wanted 12, not 11? That's output_padding=1:

Code

H_out = (6 − 1) × 2 − 2×1 + 1×(3−1) + 1 + 1
      = 10 − 2 + 2 + 1 + 1
      = 12  ✓

The `padding` Gotcha

In Conv2d, padding adds zeros around the input — making the output bigger.

In ConvTranspose2d, padding removes rows/columns from the output edges. This is counterintuitive but deliberate. Internally:

Code

effective_zero_border = dilation × (kernel_size − 1) − padding

This design ensures that a Conv2d and ConvTranspose2d with the same parameters are shape-inverses of each other — which is exactly what you want in encoder-decoder architectures.

Common mistake: setting padding=1 expecting it to add border pixels. It actually removes 1 pixel from each side of the output. Always verify your output shape with the formula before training.

Rule of thumb: For clean 2× upsampling, use kernel_size=4, stride=2, padding=1. This gives exactly 2 × H_in with no ambiguity and no output_padding needed.

The `output_padding` Parameter

When stride > 1, multiple input shapes can produce the same Conv2d output shape. So when reversing direction, there's ambiguity — which size were you originally at?

output_padding resolves this by adding extra rows/columns to one side of the output:

Does not add learnable parameters
Does not pad symmetrically
Only affects the computed output shape
Valid values: 0 to stride − 1

Python

# stride=2 → output_padding can be 0 or 1 only
upsample = nn.ConvTranspose2d(
    16, 16,
    kernel_size=3, stride=2,
    padding=1, output_padding=1
)

Code Walkthrough: Symmetric Encoder-Decoder

Python

import torch
import torch.nn as nn

input = torch.randn(1, 16, 12, 12)   # [batch, channels, H, W]

# Encoder: halves spatial dims (12 → 6)
downsample = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)

# Decoder: should restore spatial dims (6 → 12)
upsample = nn.ConvTranspose2d(16, 16, kernel_size=3, stride=2, padding=1)

h = downsample(input)
print(h.size())   # torch.Size([1, 16, 6, 6])

# Pass output_size to resolve shape ambiguity automatically
output = upsample(h, output_size=input.size())
print(output.size())   # torch.Size([1, 16, 12, 12])

Without output_size=input.size(), the upsample produces [1, 16, 11, 11] — one pixel short. Passing target size lets PyTorch compute output_padding automatically.

Key Parameters at a Glance

Parameter	Role	Common values
`kernel_size`	Size of convolution kernel	`3`, `4`
`stride`	Controls upsampling factor	`2` for 2× upsampling
`padding`	Crops output border	`1` with `kernel_size=3` or `4`
`output_padding`	Resolves shape ambiguity	`0` or `1` (must be < stride)
`dilation`	Spacing between kernel weights	`1` (default, rarely changed here)
`groups`	Depthwise separability	`1` standard, `in_channels` for depthwise

Common Configurations

Python

# 1) Exact 2× upsampling — most common in practice
nn.ConvTranspose2d(C_in, C_out, kernel_size=4, stride=2, padding=1)
# H_out = (H_in − 1)*2 − 2 + 3 + 1 = 2*H_in  ✓

# 2) Symmetric inverse of a Conv2d
nn.Conv2d(C, C, kernel_size=3, stride=2, padding=1)          # encoder
nn.ConvTranspose2d(C, C, kernel_size=3, stride=2, padding=1) # decoder (use output_size)

# 3) Non-square upsampling
nn.ConvTranspose2d(16, 33, kernel_size=(3, 5), stride=(2, 1), padding=(4, 2))

Where It Appears in Real Architectures

U-Net (Ronneberger et al., 2015): The decoder path uses transposed convolutions to progressively restore spatial resolution from the bottleneck, concatenating skip connections from the encoder at each scale. Andrew Ng's deeplearning.ai course covers this architecture in detail, including the transposed convolution step.

DCGAN (Radford et al., 2015): The generator transforms a 100-dim noise vector into a full image through successive ConvTranspose2d layers — each doubling spatial resolution and halving channel count.

Semantic Segmentation heads: FCN, DeepLab variants, and SegFormer decoder heads all use transposed convolutions (or bilinear upsampling + conv) to match output resolution to input resolution.

Summary

ConvTranspose2d is learnable upsampling, not true deconvolution
It works by inserting zeros between inputs then convolving — enabling spatial expansion
padding removes output border pixels (opposite of Conv2d intuition)
Use output_padding or output_size in forward() to resolve shape ambiguity when stride > 1
For clean 2× upsampling: kernel_size=4, stride=2, padding=1 is the reliable recipe

References

Prerequisites: Familiarity with Conv2d, feature maps, stride, and padding in CNNs.

Why Upsampling Matters

You'll find ConvTranspose2d in:

U-Net decoder path (semantic segmentation)
DCGAN generator (image synthesis from noise)
VAE decoder (latent space → full image)

What ConvTranspose2d Actually Does

Despite the name "deconvolution" floating around in older papers, this is not a true mathematical inverse of convolution. The accurate mental model has two steps:

Insert zeros between input elements along spatial dimensions (controlled by stride)
Apply a standard convolution over the result

This zero-insertion is what allows the output to be spatially larger than the input. When stride=2, a single input element "spreads" its influence across a 2×2 region in the output.

Code

Input (2×2):          After zero-insertion (stride=2):

  1  2                  1  0  2
  3  4                  0  0  0
                        3  0  4

The kernel then slides over this expanded tensor, learning how to fill in the interpolated values — which is what makes it superior to fixed upsampling methods.

The Matrix Transposition Connection

The name "transposed convolution" comes from linear algebra. If a regular convolution can be expressed as a matrix multiplication:

Code

Y = C · X

...where C is the convolution matrix derived from the kernel, then the transposed convolution computes:

Code

X̂ = Cᵀ · Y

The Output Size Formula

This is where most confusion happens. The output height and width are:

Code

H_out = (H_in − 1) × stride[0]
        − 2 × padding[0]
        + dilation[0] × (kernel_size[0] − 1)
        + output_padding[0]
        + 1

Concrete example — the most common configuration: kernel_size=3, stride=2, padding=1

Code

H_in = 6

H_out = (6 − 1) × 2 − 2×1 + 1×(3−1) + 0 + 1
      = 10 − 2 + 2 + 0 + 1
      = 11   ← not 12!

You wanted 12, not 11? That's output_padding=1:

Code

H_out = (6 − 1) × 2 − 2×1 + 1×(3−1) + 1 + 1
      = 10 − 2 + 2 + 1 + 1
      = 12  ✓

The `padding` Gotcha

In Conv2d, padding adds zeros around the input — making the output bigger.

In ConvTranspose2d, padding removes rows/columns from the output edges. This is counterintuitive but deliberate. Internally:

Code

effective_zero_border = dilation × (kernel_size − 1) − padding

This design ensures that a Conv2d and ConvTranspose2d with the same parameters are shape-inverses of each other — which is exactly what you want in encoder-decoder architectures.

Common mistake: setting padding=1 expecting it to add border pixels. It actually removes 1 pixel from each side of the output. Always verify your output shape with the formula before training.

Rule of thumb: For clean 2× upsampling, use kernel_size=4, stride=2, padding=1. This gives exactly 2 × H_in with no ambiguity and no output_padding needed.

The `output_padding` Parameter

When stride > 1, multiple input shapes can produce the same Conv2d output shape. So when reversing direction, there's ambiguity — which size were you originally at?

output_padding resolves this by adding extra rows/columns to one side of the output:

Does not add learnable parameters
Does not pad symmetrically
Only affects the computed output shape
Valid values: 0 to stride − 1

Python

# stride=2 → output_padding can be 0 or 1 only
upsample = nn.ConvTranspose2d(
    16, 16,
    kernel_size=3, stride=2,
    padding=1, output_padding=1
)

Code Walkthrough: Symmetric Encoder-Decoder

Python

import torch
import torch.nn as nn

input = torch.randn(1, 16, 12, 12)   # [batch, channels, H, W]

# Encoder: halves spatial dims (12 → 6)
downsample = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)

# Decoder: should restore spatial dims (6 → 12)
upsample = nn.ConvTranspose2d(16, 16, kernel_size=3, stride=2, padding=1)

h = downsample(input)
print(h.size())   # torch.Size([1, 16, 6, 6])

# Pass output_size to resolve shape ambiguity automatically
output = upsample(h, output_size=input.size())
print(output.size())   # torch.Size([1, 16, 12, 12])

Without output_size=input.size(), the upsample produces [1, 16, 11, 11] — one pixel short. Passing target size lets PyTorch compute output_padding automatically.

Key Parameters at a Glance

Parameter	Role	Common values
`kernel_size`	Size of convolution kernel	`3`, `4`
`stride`	Controls upsampling factor	`2` for 2× upsampling
`padding`	Crops output border	`1` with `kernel_size=3` or `4`
`output_padding`	Resolves shape ambiguity	`0` or `1` (must be < stride)
`dilation`	Spacing between kernel weights	`1` (default, rarely changed here)
`groups`	Depthwise separability	`1` standard, `in_channels` for depthwise

Common Configurations

Python

# 1) Exact 2× upsampling — most common in practice
nn.ConvTranspose2d(C_in, C_out, kernel_size=4, stride=2, padding=1)
# H_out = (H_in − 1)*2 − 2 + 3 + 1 = 2*H_in  ✓

# 2) Symmetric inverse of a Conv2d
nn.Conv2d(C, C, kernel_size=3, stride=2, padding=1)          # encoder
nn.ConvTranspose2d(C, C, kernel_size=3, stride=2, padding=1) # decoder (use output_size)

# 3) Non-square upsampling
nn.ConvTranspose2d(16, 33, kernel_size=(3, 5), stride=(2, 1), padding=(4, 2))

Where It Appears in Real Architectures

Semantic Segmentation heads: FCN, DeepLab variants, and SegFormer decoder heads all use transposed convolutions (or bilinear upsampling + conv) to match output resolution to input resolution.

Summary

ConvTranspose2d is learnable upsampling, not true deconvolution
It works by inserting zeros between inputs then convolving — enabling spatial expansion
padding removes output border pixels (opposite of Conv2d intuition)
Use output_padding or output_size in forward() to resolve shape ambiguity when stride > 1
For clean 2× upsampling: kernel_size=4, stride=2, padding=1 is the reliable recipe

ConvTranspose2d Explained: Learnable Upsampling in PyTorch

Why Upsampling Matters

What ConvTranspose2d Actually Does

The Matrix Transposition Connection

The Output Size Formula

The `padding` Gotcha

The `output_padding` Parameter

Code Walkthrough: Symmetric Encoder-Decoder

Key Parameters at a Glance

Common Configurations

Where It Appears in Real Architectures

Summary

References

Share this post

More posts

The Monty Hall Problem: Why Your Gut Is Wrong, and How to Prove It

The Curse of Dimensionality: Why Geometric Deep Learning Has to Exist

ConvTranspose2d Explained: Learnable Upsampling in PyTorch

Why Upsampling Matters

What ConvTranspose2d Actually Does

The Matrix Transposition Connection

The Output Size Formula

The `padding` Gotcha

The `output_padding` Parameter

Code Walkthrough: Symmetric Encoder-Decoder

Key Parameters at a Glance

Common Configurations

Where It Appears in Real Architectures

Summary

References

Share this post

More posts

The Monty Hall Problem: Why Your Gut Is Wrong, and How to Prove It

The Curse of Dimensionality: Why Geometric Deep Learning Has to Exist

Why Upsampling Matters

What ConvTranspose2d Actually Does

The Matrix Transposition Connection

The Output Size Formula

The padding Gotcha

The output_padding Parameter

Code Walkthrough: Symmetric Encoder-Decoder

Key Parameters at a Glance

Common Configurations

Where It Appears in Real Architectures

Summary

References

Share this post

More posts

The Monty Hall Problem: Why Your Gut Is Wrong, and How to Prove It

The Curse of Dimensionality: Why Geometric Deep Learning Has to Exist

Why Upsampling Matters

What ConvTranspose2d Actually Does

The Matrix Transposition Connection

The Output Size Formula

The padding Gotcha

The output_padding Parameter

Code Walkthrough: Symmetric Encoder-Decoder

Key Parameters at a Glance

Common Configurations

Where It Appears in Real Architectures

Summary

References

Share this post

More posts

The Monty Hall Problem: Why Your Gut Is Wrong, and How to Prove It

The Curse of Dimensionality: Why Geometric Deep Learning Has to Exist

The `padding` Gotcha

The `output_padding` Parameter

The `padding` Gotcha

The `output_padding` Parameter