Module 154

Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT

The Core Problem Swin Solves

Model	Self-Attention Complexity	Can handle 1024×1024 image?	Memory (224×224)	Memory (512×512)
Original ViT	O((HW)²) = O(N²)	No, explodes	~1 GB	~20+ GB (dead)
Swin	O(HW) ≈ linear	Yes, easily	~200 MB	~800 MB

ViT computes attention between all pairs of patches → 14,400 patches (224/16)² → 200 million attention scores → dead on high-res images.

Swin’s genius idea:
“Don’t do global attention. Do attention only inside small local windows.”
→ Complexity drops from O(N²) to O(N)

How Swin Window Attention Works – Step by Step

Step 1: Divide Image into Non-Overlapping Windows

Default window size M = 7 → each window is 7×7 = 49 patches
Example: 224×224 image, patch_size=4 → feature map 56×56
→ 8×8 = 64 windows of size 7×7 each

Image → Patches → H×W feature map
      ↓
Divide into M×M windows (non-overlapping)
      ↓
Each window does self-attention independently

Step 2: Regular Window Attention (Like Mini-ViT per Window)

Inside each 7×7 window:

49 patches → 49 tokens
Compute Q, K, V → attention scores (49×49 matrix)
Apply relative position bias (very important!)
Output same 49 tokens

Total complexity per layer: 64 windows × (49²) = 64 × 2401 ≈ 153,664 operations
vs ViT’s (56×56)² = 9.8 million operations
→ ~60× cheaper!

Step 3: The Magic – Shifted Windows in Next Block

Problem: Regular windows have no communication between windows → no global context!

Swin’s breakthrough: In every second block, shift the windows by (M/2, M/2) pixels
→ Now windows overlap across boundaries → information flows!

Layer 1: Regular windows
┌─────┬─────┬─────┐
│  A  │  B  │  C  │
├─────┼─────┼─────┤
│  D  │  E  │  F  │
└─────┴─────┴─────┘

Layer 2: Shifted windows (shift by 3 or 4 pixels)
  ┌─────┬─────┐
  │  E  │  F  │
┌─────┼─────┼─────┐
│  B  │  C  │     │
├─────┼─────┼─────┤
│  E  │  F  │     │
└─────┴─────┴─────┘

Now patch in window A can attend to patch in window B through the shifted path!

Step 4: Cyclic Shift Trick (Efficient Implementation)

Instead of actually cropping shifted windows (expensive), Swin does:

# Before attention in shifted block
x_shifted = torch.roll(x, shifts=(-shift_size, -shift_size), dims=(1,2))

# After attention
x = torch.roll(x_shifted, shifts=(shift_size, shift_size), dims=(1,2))

→ Zero overhead, perfect shift!

Step 5: Masking in Shifted Windows

After shifting, some patches in a window come from 4 different original windows
→ If we don’t mask, they would illegally attend to each other.

Solution: Create attention mask

Patches from different original windows → mask value = -100
Same window → 0

→ After softmax → zero attention across original window boundaries
→ Preserves locality!

Mathematical Complexity Proof

Method	Attention Complexity per Layer	Total for 4 stages
Global (ViT)	O((HW)²)	O(N²)
Swin (Window=7)	O(HW × M²) = O(HW × 49)	~O(N)
Swin (with shift)	Still O(HW × M²)	Linear!

Since M is fixed (7 or 12), complexity is linear in image size → scales to 4K images!

Relative Position Bias (The Secret Sauce)

Swin doesn’t use absolute or learned positional embeddings per patch.

Instead: Learn a small bias table B of size (2M−1)×(2M−1) × num_heads
Example: M=7 → 13×13 = 169 biases per head

For any relative position (Δx, Δy), add B[Δx, Δy] to attention logit
→ Translation invariant + very few parameters!

This is why Swin generalizes so well across resolutions.

Visual Summary – How Information Flows

Layer 1 (Regular Windows)     → Local only
Layer 2 (Shifted Windows)     → Connects adjacent windows
Layer 3 (Regular)             → Local again
Layer 4 (Shifted)             → Connects further
...
After 4–6 stages → Global receptive field!

Just like CNNs build hierarchy, but with attention!

Comparison Table (Memorize This!)

Feature	ViT (Global)	Swin (Window + Shifted)
Attention Scope	Global	Local → Global via hierarchy
Complexity	Quadratic O(N²)	Linear O(N)
Max Resolution (reasonable)	384–512px	2048px+ (used in SAM, Florence)
Translation Invariance	Learned	Built-in (relative bias + shift)
Inductive Bias	None	Locality + hierarchy
Best For	Large data	Detection, segmentation, video
ImageNet-1K Top-1	88.5% (ViT-L)	87.3% (Swin-L) + much faster

Code Snippet – The Heart (Just 10 lines!)

# In shifted block
if self.shift_size > 0:
    x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1,2))

# Partition into windows → attention → reverse
x_windows = window_partition(x, self.window_size)           # → many small windows
x_windows = x_windows.view(-1, M*M, C)
attn_windows = self.attn(x_windows, mask=attn_mask)        # ← only inside window
# ... merge back ...

# Unshift
if self.shift_size > 0:
    x = torch.roll(x, shifts=(self.shift_size, self.shift_size), dims=(1,2))

This 10-line trick made transformers practical for vision.

Why Swin Won Everything After 2021

2021: Beat CNNs and ViT on ImageNet
2022: Backbone of Mask R-CNN, Cascade R-CNN → COCO SOTA
2023: Swin-V2 → ImageNet-22K + 3B params → beats CLIP
2024–2025: Default backbone in YOLOv8, RT-DETR, Florence-2, SAM-2, etc.

Final Summary – Why Window Attention is Genius

Problem	ViT Solution	Swin Solution
Quadratic complexity	Accept it	Fixed windows → linear
No locality bias	Add pos embed	Windows + relative bias → strong
Poor at high resolution	Downsample early	Hierarchical stages
Slow cross-window info flow	None	Shifted windows → fast flow

Swin Transformer proved that you can have the best of both worlds:
Transformer flexibility + CNN efficiency and inductive bias.

This is why, in 2025, Swin (and its children: Swin-V2, Swin-MOE, FocalNet, etc.) is the most widely used vision backbone in the world.

You now fully understand why Swin’s window attention is one of the most important ideas in deep learning since ReLU.