Loading...
Development

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE


Module Objective

Deep dive into Positional Encodingsignal processing, hashing, Fourier theory, and Sinusoidal vs Learned — with math, code, visualization, and ablation.


1. The Problem: Attention is Permutation-Invariant

X = ["the", "cat", "sat"]
Attention(X) == Attention(["sat", "cat", "the"])

No order → no meaning


2. Two Solutions

TypeMechanismLearnable?Max Length
Sinusoidal (Fixed)Wave functionsNoInfinite
Learned (Trainable)Embedding tableYesFixed

3. Sinusoidal PE — Signal Processing View

Formula (Original Paper)

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) $$

Each dimension = a sine wave with different frequency


4. Signal Processing Interpretation

import torch
import matplotlib.pyplot as plt
import numpy as np

def plot_sinusoidal_pe(d_model=16, max_pos=20):
    pos = torch.arange(max_pos).unsqueeze(1)
    i = torch.arange(0, d_model, 2)
    div_term = torch.exp(i * -torch.log(torch.tensor(10000.0)) / d_model)
    pe_even = torch.sin(pos * div_term)
    pe_odd = torch.cos(pos * div_term)
    
    pe = torch.zeros(max_pos, d_model)
    pe[:, 0::2] = pe_even
    pe[:, 1::2] = pe_odd
    
    plt.figure(figsize=(12, 6))
    for dim in range(0, d_model, 2):
        plt.plot(pos, pe[:, dim], label=f"dim {dim}" if dim < 6 else "")
    plt.legend()
    plt.xlabel("Position")
    plt.ylabel("PE Value")
    plt.title("Sinusoidal PE: Different Frequencies per Dimension")
    plt.grid(True, alpha=0.3)
    plt.show()

plot_sinusoidal_pe()

Low dims → slow waves → long-range patterns
High dims → fast waves → fine-grained local patterns


5. Fourier Basis: Why It Works

Any smooth function can be represented as sum of sines/cosines
PE spans a rich frequency space

# Relative distance encoding
pos_i, pos_j = 5, 10
pe_i = pe[pos_i]
pe_j = pe[pos_j]

# Dot product peaks at fixed relative distance
dist = 5
correlations = []
for offset in range(-10, 11):
    if 0 <= pos_i + offset < max_pos:
        corr = torch.dot(pe[pos_i], pe[pos_i + offset])
        correlations.append((offset, corr.item()))

offsets, corrs = zip(*correlations)
plt.plot(offsets, corrs, 'o-')
plt.title("PE Correlation vs Relative Position")
plt.xlabel("Position Offset")
plt.ylabel("Dot Product")
plt.show()

Model can compute relative position via dot product!


6. Hashing Perspective: Sinusoidal PE as Locality-Sensitive Hash

Idea: Similar positions → similar PE vectors

from sklearn.metrics.pairwise import cosine_similarity

pos1, pos2 = 100, 105
pe1 = pe[pos1].unsqueeze(0)
pe2 = pe[pos2].unsqueeze(0)
sim = cosine_similarity(pe1.numpy(), pe2.numpy())[0][0]
print(f"Cosine sim(pos=100, 105) = {sim:.3f}")  # ~0.999

LSH property:
$ \text{sim}(PE_i, PE_j) \propto \exp(-|i-j|) $
Attention can infer distance without explicit position IDs


7. Learned Positional Encoding

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)
        nn.init.normal_(self.pe.weight, std=0.02)
        
    def forward(self, x):
        seq_len = x.size(1)
        pos = torch.arange(seq_len, device=x.device)
        return x + self.pe(pos)

8. Sinusoidal vs Learned: Ablation Study

import torch.optim as optim

def train_copy_task(model_cls, use_learned_pe=False, max_len=20):
    model = nn.Sequential(
        nn.Embedding(10, 16),
        model_cls(d_model=16, num_heads=4, use_learned_pe=use_learned_pe),
        nn.Linear(16, 10)
    )
    opt = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()
    
    losses = []
    for epoch in range(300):
        src = torch.randint(0, 5, (32, max_len))
        tgt = src.clone()
        
        logits = model[0](src)
        logits = model[1](logits)[0]
        logits = model[2](logits)
        
        loss = criterion(logits.view(-1, 10), tgt.view(-1))
        opt.zero_grad()
        loss.backward()
        opt.step()
        losses.append(loss.item())
    
    return losses

# Run both
loss_sine = train_copy_task(TransformerBlock, use_learned_pe=False)
loss_learned = train_copy_task(TransformerBlock, use_learned_pe=True)

plt.plot(loss_sine, label="Sinusoidal PE")
plt.plot(loss_learned, label="Learned PE")
plt.legend()
plt.title("Copy Task: Sinusoidal vs Learned PE")
plt.xlabel("Training Step")
plt.ylabel("Loss")
plt.show()

Result:

  • Sinusoidal: Faster convergence, better generalization
  • Learned: Can overfit to training length

9. Extrapolation Test: Can It Handle Longer Sequences?

# Train on max_len=20
model_sine = ...  # trained with sinusoidal
model_learned = ...  # trained with learned (max_len=20)

# Test on length 50
long_seq = torch.randint(0, 5, (1, 50))
with torch.no_grad():
    out_sine = model_sine(long_seq)
    # out_learned → IndexError! (Embedding size = 20)

Sinusoidal: Works for any length
Learned: Limited to training length


10. Hashing Analogy: PE as Embedding Hash

ConceptSinusoidal PELearned PE
Hash Function$ \sin(pos \cdot \omega_i) $$ E[pos] $
CollisionSmoothDiscrete
Range$ \mathbb{R} $$ \mathbb{R}^d $
Collision Probability$ \propto \exp(-i-j

Sinusoidal = continuous LSH
Learned = perfect hash (but limited domain)


11. Advanced: Rotary Positional Embedding (RoPE)

Used in LLaMA, PaLMrelative + rotation

def apply_rotary_emb(q, k, freqs):
    # q, k: (B, H, N, d_k)
    q_real, q_imag = q[..., :d_k//2], q[..., d_k//2:]
    k_real, k_imag = k[..., :d_k//2], k[..., d_k//2:]
    
    # Rotate
    q_rot = torch.cat([-q_imag, q_real], dim=-1) * freqs
    k_rot = torch.cat([-k_imag, k_real], dim=-1) * freqs
    
    return q_rot + q, k_rot + k

Preserves absolute position via rotation in complex plane


12. Summary Table

FeatureSinusoidalLearnedRoPE
LearnableNoYesNo
Max LengthInfiniteFixedInfinite
Relative PosYes (via dot)NoYes (explicit)
Signal TheoryFourier basisArbitraryRotation
HashingLSHPerfectGeometric
Used InGPT-2, BERTEarly TransformersLLaMA, PaLM

13. Visualization: PE Heatmap

pe_sine = SinusoidalPositionalEncoding(128, 100).pe[0].cpu().numpy()
pe_learned = LearnedPositionalEncoding(128, 100).pe.weight.detach().cpu().numpy()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
sns.heatmap(pe_sine, ax=ax1, cmap="RdYlBu", center=0)
sns.heatmap(pe_learned, ax=ax2, cmap="RdYlBu", center=0)
ax1.set_title("Sinusoidal PE")
ax2.set_title("Learned PE (Random Init)")
plt.show()

14. Practice Exercises

  1. Fourier Analysis: Compute FFT of PE across positions.
  2. Hash Collision: Measure cosine sim for $ |i-j| = 1, 5, 10 $.
  3. Ablation: Train without PE → accuracy drops to ~10%.
  4. Hybrid: Use sinusoidal + learned (T5-style).
  5. RoPE: Implement and compare with sinusoidal.

15. Key Takeaways

CheckInsight
CheckSinusoidal PE = Fourier basis + LSH
CheckLearned PE = flexible but length-limited
CheckRelative position emerges from dot product
CheckSinusoidal generalizes to any length
CheckRoPE = modern geometric alternative

Full Code: Sinusoidal vs Learned

import torch
import torch.nn as nn

# === Sinusoidal ===
class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float() * -(torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer('pe', pe.unsqueeze(0))
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# === Learned ===
class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)
    def forward(self, x):
        pos = torch.arange(x.size(1), device=x.device)
        return x + self.pe(pos)

Final Words

Positional Encoding is not just a hack
→ It’s signal processing, hashing, and geometry in disguise.

You now understand:

  • Why sinusoidal works
  • Why learned fails to extrapolate
  • How relative position emerges
  • Modern RoPE alternative

End of Module
You control time in neural networks.
Next: Stack 12 layers → build a Transformer!