Loading...
Development

Module 154

Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT

The Core Problem Swin Solves

ModelSelf-Attention ComplexityCan handle 1024×1024 image?Memory (224×224)Memory (512×512)
Original ViTO((HW)²) = O(N²)No, explodes~1 GB~20+ GB (dead)
SwinO(HW) ≈ linearYes, easily~200 MB~800 MB

ViT computes attention between all pairs of patches → 14,400 patches (224/16)² → 200 million attention scores → dead on high-res images.

Swin’s genius idea:
“Don’t do global attention. Do attention only inside small local windows.”
→ Complexity drops from O(N²) to O(N)

How Swin Window Attention Works – Step by Step

Step 1: Divide Image into Non-Overlapping Windows

  • Default window size M = 7 → each window is 7×7 = 49 patches
  • Example: 224×224 image, patch_size=4 → feature map 56×56
  • → 8×8 = 64 windows of size 7×7 each
Image → Patches → H×W feature map
      ↓
Divide into M×M windows (non-overlapping)
      ↓
Each window does self-attention independently

Step 2: Regular Window Attention (Like Mini-ViT per Window)

Inside each 7×7 window:

  • 49 patches → 49 tokens
  • Compute Q, K, V → attention scores (49×49 matrix)
  • Apply relative position bias (very important!)
  • Output same 49 tokens

Total complexity per layer: 64 windows × (49²) = 64 × 2401 ≈ 153,664 operations
vs ViT’s (56×56)² = 9.8 million operations
→ ~60× cheaper!

Step 3: The Magic – Shifted Windows in Next Block

Problem: Regular windows have no communication between windows → no global context!

Swin’s breakthrough: In every second block, shift the windows by (M/2, M/2) pixels
→ Now windows overlap across boundaries → information flows!

Layer 1: Regular windows
┌─────┬─────┬─────┐
│  A  │  B  │  C  │
├─────┼─────┼─────┤
│  D  │  E  │  F  │
└─────┴─────┴─────┘

Layer 2: Shifted windows (shift by 3 or 4 pixels)
  ┌─────┬─────┐
  │  E  │  F  │
┌─────┼─────┼─────┐
│  B  │  C  │     │
├─────┼─────┼─────┤
│  E  │  F  │     │
└─────┴─────┴─────┘

Now patch in window A can attend to patch in window B through the shifted path!

Step 4: Cyclic Shift Trick (Efficient Implementation)

Instead of actually cropping shifted windows (expensive), Swin does:

# Before attention in shifted block
x_shifted = torch.roll(x, shifts=(-shift_size, -shift_size), dims=(1,2))

# After attention
x = torch.roll(x_shifted, shifts=(shift_size, shift_size), dims=(1,2))

→ Zero overhead, perfect shift!

Step 5: Masking in Shifted Windows

After shifting, some patches in a window come from 4 different original windows
→ If we don’t mask, they would illegally attend to each other.

Solution: Create attention mask

  • Patches from different original windows → mask value = -100
  • Same window → 0

→ After softmax → zero attention across original window boundaries
→ Preserves locality!

Mathematical Complexity Proof

MethodAttention Complexity per LayerTotal for 4 stages
Global (ViT)O((HW)²)O(N²)
Swin (Window=7)O(HW × M²) = O(HW × 49)~O(N)
Swin (with shift)Still O(HW × M²)Linear!

Since M is fixed (7 or 12), complexity is linear in image size → scales to 4K images!

Relative Position Bias (The Secret Sauce)

Swin doesn’t use absolute or learned positional embeddings per patch.

Instead: Learn a small bias table B of size (2M−1)×(2M−1) × num_heads
Example: M=7 → 13×13 = 169 biases per head

For any relative position (Δx, Δy), add B[Δx, Δy] to attention logit
→ Translation invariant + very few parameters!

This is why Swin generalizes so well across resolutions.

Visual Summary – How Information Flows

Layer 1 (Regular Windows)     → Local only
Layer 2 (Shifted Windows)     → Connects adjacent windows
Layer 3 (Regular)             → Local again
Layer 4 (Shifted)             → Connects further
...
After 4–6 stages → Global receptive field!

Just like CNNs build hierarchy, but with attention!

Comparison Table (Memorize This!)

FeatureViT (Global)Swin (Window + Shifted)
Attention ScopeGlobalLocal → Global via hierarchy
ComplexityQuadratic O(N²)Linear O(N)
Max Resolution (reasonable)384–512px2048px+ (used in SAM, Florence)
Translation InvarianceLearnedBuilt-in (relative bias + shift)
Inductive BiasNoneLocality + hierarchy
Best ForLarge dataDetection, segmentation, video
ImageNet-1K Top-188.5% (ViT-L)87.3% (Swin-L) + much faster

Code Snippet – The Heart (Just 10 lines!)

# In shifted block
if self.shift_size > 0:
    x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1,2))

# Partition into windows → attention → reverse
x_windows = window_partition(x, self.window_size)           # → many small windows
x_windows = x_windows.view(-1, M*M, C)
attn_windows = self.attn(x_windows, mask=attn_mask)        # ← only inside window
# ... merge back ...

# Unshift
if self.shift_size > 0:
    x = torch.roll(x, shifts=(self.shift_size, self.shift_size), dims=(1,2))

This 10-line trick made transformers practical for vision.

Why Swin Won Everything After 2021

  • 2021: Beat CNNs and ViT on ImageNet
  • 2022: Backbone of Mask R-CNN, Cascade R-CNN → COCO SOTA
  • 2023: Swin-V2 → ImageNet-22K + 3B params → beats CLIP
  • 2024–2025: Default backbone in YOLOv8, RT-DETR, Florence-2, SAM-2, etc.

Final Summary – Why Window Attention is Genius

ProblemViT SolutionSwin Solution
Quadratic complexityAccept itFixed windows → linear
No locality biasAdd pos embedWindows + relative bias → strong
Poor at high resolutionDownsample earlyHierarchical stages
Slow cross-window info flowNoneShifted windows → fast flow

Swin Transformer proved that you can have the best of both worlds:
Transformer flexibility + CNN efficiency and inductive bias.

This is why, in 2025, Swin (and its children: Swin-V2, Swin-MOE, FocalNet, etc.) is the most widely used vision backbone in the world.

You now fully understand why Swin’s window attention is one of the most important ideas in deep learning since ReLU.