Module 154
Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation
Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT
The Core Problem Swin Solves
| Model | Self-Attention Complexity | Can handle 1024×1024 image? | Memory (224×224) | Memory (512×512) |
|---|---|---|---|---|
| Original ViT | O((HW)²) = O(N²) | No, explodes | ~1 GB | ~20+ GB (dead) |
| Swin | O(HW) ≈ linear | Yes, easily | ~200 MB | ~800 MB |
ViT computes attention between all pairs of patches → 14,400 patches (224/16)² → 200 million attention scores → dead on high-res images.
Swin’s genius idea:
“Don’t do global attention. Do attention only inside small local windows.”
→ Complexity drops from O(N²) to O(N)
How Swin Window Attention Works – Step by Step
Step 1: Divide Image into Non-Overlapping Windows
- Default window size M = 7 → each window is 7×7 = 49 patches
- Example: 224×224 image, patch_size=4 → feature map 56×56
- → 8×8 = 64 windows of size 7×7 each
Image → Patches → H×W feature map
↓
Divide into M×M windows (non-overlapping)
↓
Each window does self-attention independently
Step 2: Regular Window Attention (Like Mini-ViT per Window)
Inside each 7×7 window:
- 49 patches → 49 tokens
- Compute Q, K, V → attention scores (49×49 matrix)
- Apply relative position bias (very important!)
- Output same 49 tokens
Total complexity per layer:
64 windows × (49²) = 64 × 2401 ≈ 153,664 operations
vs ViT’s (56×56)² = 9.8 million operations
→ ~60× cheaper!
Step 3: The Magic – Shifted Windows in Next Block
Problem: Regular windows have no communication between windows → no global context!
Swin’s breakthrough: In every second block, shift the windows by (M/2, M/2) pixels
→ Now windows overlap across boundaries → information flows!
Layer 1: Regular windows
┌─────┬─────┬─────┐
│ A │ B │ C │
├─────┼─────┼─────┤
│ D │ E │ F │
└─────┴─────┴─────┘
Layer 2: Shifted windows (shift by 3 or 4 pixels)
┌─────┬─────┐
│ E │ F │
┌─────┼─────┼─────┐
│ B │ C │ │
├─────┼─────┼─────┤
│ E │ F │ │
└─────┴─────┴─────┘
Now patch in window A can attend to patch in window B through the shifted path!
Step 4: Cyclic Shift Trick (Efficient Implementation)
Instead of actually cropping shifted windows (expensive), Swin does:
# Before attention in shifted block
x_shifted = torch.roll(x, shifts=(-shift_size, -shift_size), dims=(1,2))
# After attention
x = torch.roll(x_shifted, shifts=(shift_size, shift_size), dims=(1,2))
→ Zero overhead, perfect shift!
Step 5: Masking in Shifted Windows
After shifting, some patches in a window come from 4 different original windows
→ If we don’t mask, they would illegally attend to each other.
Solution: Create attention mask
- Patches from different original windows → mask value = -100
- Same window → 0
→ After softmax → zero attention across original window boundaries
→ Preserves locality!
Mathematical Complexity Proof
| Method | Attention Complexity per Layer | Total for 4 stages |
|---|---|---|
| Global (ViT) | O((HW)²) | O(N²) |
| Swin (Window=7) | O(HW × M²) = O(HW × 49) | ~O(N) |
| Swin (with shift) | Still O(HW × M²) | Linear! |
Since M is fixed (7 or 12), complexity is linear in image size → scales to 4K images!
Relative Position Bias (The Secret Sauce)
Swin doesn’t use absolute or learned positional embeddings per patch.
Instead: Learn a small bias table B of size (2M−1)×(2M−1) × num_heads
Example: M=7 → 13×13 = 169 biases per head
For any relative position (Δx, Δy), add B[Δx, Δy] to attention logit
→ Translation invariant + very few parameters!
This is why Swin generalizes so well across resolutions.
Visual Summary – How Information Flows
Layer 1 (Regular Windows) → Local only
Layer 2 (Shifted Windows) → Connects adjacent windows
Layer 3 (Regular) → Local again
Layer 4 (Shifted) → Connects further
...
After 4–6 stages → Global receptive field!
Just like CNNs build hierarchy, but with attention!
Comparison Table (Memorize This!)
| Feature | ViT (Global) | Swin (Window + Shifted) |
|---|---|---|
| Attention Scope | Global | Local → Global via hierarchy |
| Complexity | Quadratic O(N²) | Linear O(N) |
| Max Resolution (reasonable) | 384–512px | 2048px+ (used in SAM, Florence) |
| Translation Invariance | Learned | Built-in (relative bias + shift) |
| Inductive Bias | None | Locality + hierarchy |
| Best For | Large data | Detection, segmentation, video |
| ImageNet-1K Top-1 | 88.5% (ViT-L) | 87.3% (Swin-L) + much faster |
Code Snippet – The Heart (Just 10 lines!)
# In shifted block
if self.shift_size > 0:
x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1,2))
# Partition into windows → attention → reverse
x_windows = window_partition(x, self.window_size) # → many small windows
x_windows = x_windows.view(-1, M*M, C)
attn_windows = self.attn(x_windows, mask=attn_mask) # ← only inside window
# ... merge back ...
# Unshift
if self.shift_size > 0:
x = torch.roll(x, shifts=(self.shift_size, self.shift_size), dims=(1,2))
This 10-line trick made transformers practical for vision.
Why Swin Won Everything After 2021
- 2021: Beat CNNs and ViT on ImageNet
- 2022: Backbone of Mask R-CNN, Cascade R-CNN → COCO SOTA
- 2023: Swin-V2 → ImageNet-22K + 3B params → beats CLIP
- 2024–2025: Default backbone in YOLOv8, RT-DETR, Florence-2, SAM-2, etc.
Final Summary – Why Window Attention is Genius
| Problem | ViT Solution | Swin Solution |
|---|---|---|
| Quadratic complexity | Accept it | Fixed windows → linear |
| No locality bias | Add pos embed | Windows + relative bias → strong |
| Poor at high resolution | Downsample early | Hierarchical stages |
| Slow cross-window info flow | None | Shifted windows → fast flow |
Swin Transformer proved that you can have the best of both worlds:
Transformer flexibility + CNN efficiency and inductive bias.
This is why, in 2025, Swin (and its children: Swin-V2, Swin-MOE, FocalNet, etc.) is the most widely used vision backbone in the world.
You now fully understand why Swin’s window attention is one of the most important ideas in deep learning since ReLU.