位置编码解决 Transformer 自注意力机制的「顺序失忆症」，为模型注入序列位置信息。从绝对编码到 RoPE，相对位置建模成为主流。

为什么需要位置编码

自注意力机制具有置换不变性：$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$。交换序列元素位置不影响注意力权重。

位置编码打破这种不变性：

$$\text{Input}_{final} = \text{TokenEmbedding} + \text{PositionalEncoding}$$

位置编码类型对比

类型	原理	优点	缺点	代表模型
可学习绝对编码	每个位置一个可学习向量	简单有效	无法外推超长序列	BERT, GPT-2
正弦编码	固定三角函数生成	可外推、无需训练	极长序列效果下降	Transformer, ViT
相对位置编码	编码元素间相对距离	长序列表现好	实现复杂	T5, DeBERTa
RoPE	旋转向量融合位置	优雅、外推性好	-	LLaMA, ChatGLM
ALiBi	注意力分数加线性偏置	极简单、外推强	-	BLOOM, Falcon

正弦位置编码

数学公式

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

核心性质

唯一性：不同频率正弦波组合形成唯一编码
相对位置线性表达：$PE_{pos+k}$可由$PE_{pos}$通过旋转矩阵变换得到
多尺度表示：高频编码局部位置，低频编码全局位置

线性变换推导（和差角公式）：

$$\begin{pmatrix} PE_{(pos+k, 2i)} \ PE_{(pos+k, 2i+1)} \end{pmatrix} = \begin{pmatrix} \cos\phi & \sin\phi \ -\sin\phi & \cos\phi \end{pmatrix} \begin{pmatrix} PE_{(pos, 2i)} \ PE_{(pos, 2i+1)} \end{pmatrix}$$

RoPE：旋转位置编码

RoPE 通过复数旋转将位置信息嵌入：

$$\mathbf{P}_m = \mathbf{E}_m \cdot e^{im\theta}$$

核心优势：注意力分数仅依赖相对位置差$(m-n)$

$$\langle q’_m, k’_n \rangle = \langle q, k \rangle \cos((m-n)\theta)$$

二维旋转操作

$$Rot_\theta(x) = \begin{bmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} x_0 \ x_1 \end{bmatrix}$$

设计原则

唯一性：每个位置编码唯一
外推性：能处理比训练更长的序列
相对位置不变性：相对距离编码不随绝对位置剧烈变化
效率：计算不成为瓶颈

代码实现

正弦位置编码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)

        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

RoPE 实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class RotaryEmbedding(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))

    def forward(self, x):
        seq_len = x.shape[1]
        t = torch.arange(seq_len, device=x.device).type_as(self.inv_freq)
        freqs = torch.einsum('i,j->ij', t, self.inv_freq)
        return torch.cat((freqs, freqs), dim=-1)

融合方式争议

传统采用加法融合：$x_k + p_k$

替代方案：

拼接：可能破坏语义空间连续性
相乘：$x_k \otimes p_k$，理论上有潜力但缺乏广泛验证

为什么需要位置编码

位置编码类型对比

正弦位置编码

数学公式

核心性质

RoPE：旋转位置编码

二维旋转操作

设计原则

代码实现

正弦位置编码

RoPE 实现

融合方式争议

学习资源

Comments