旋转位置编码（Rotary Position Embedding, RoPE）

旋转位置编码（Rotary Position Embedding, RoPE）是一种通过旋转矩阵将位置信息融入词向量的方法，其核心在于利用几何旋转操作使注意力分数自然携带相对位置信息。以下是其技术细节的完整解析：

核心思想：
将词向量视为复数空间中的向量，通过旋转操作改变其方向但不改变模长（正交变换），使位置信息编码在向量夹角中。注意力分数（点积）仅与相对位置相关： $$ \langle \text{RoPE}(\mathbf{q}_m, m), \text{RoPE}(\mathbf{k}_n, n) \rangle = g(\mathbf{q}, \mathbf{k}, n-m) $$
复数与旋转等价性：
复数乘法 $e^{i\theta}$ 等价于二维旋转矩阵： $$ e^{i m\theta} = \cos(m\theta) + i \sin(m\theta) \quad \Leftrightarrow \quad \mathbf{R}_m = \begin{bmatrix} \cos m\theta & -\sin m\theta \ \sin m\theta & \cos m\theta \end{bmatrix} $$ 旋转保持向量模长不变（$|\mathbf{R}_m \mathbf{v}| = |\mathbf{v}|$），避免语义信息被干扰。

对维度 $d$ 的词向量 $\mathbf{x}$，划分为 $d/2$ 个二维子向量： $$ \mathbf{x} = [x_0, x_1, \dots, x_{d-1}] \quad \rightarrow \quad \mathbf{x}^{(i)} = [x_{2i}, x_{2i+1}], \quad i \in [0, d/2-1] $$ 每个子空间独立旋转。

频率参数 $\theta_i$（指数衰减设计）： $$ \theta_i = 10000^{-2i/d}, \quad i \in [0, d/2-1] $$ 低频（高 $i$）捕获长周期位置信息，高频（低 $i$）捕获短周期信息。

对位置 $m$，第 $i$ 个子空间的旋转矩阵： $$ \mathbf{R}_m^{(i)} = \begin{bmatrix} \cos m\theta_i & -\sin m\theta_i \ \sin m\theta_i & \cos m\theta_i \end{bmatrix} $$
高维旋转矩阵 $\mathbf{R}_m$ 为块对角矩阵： $$ \mathbf{R}m = \bigoplus{i=0}^{d/2-1} \mathbf{R}_m^{(i)} = \begin{bmatrix} \mathbf{R}_m^{(0)} & \mathbf{0} & \cdots \ \mathbf{0} & \mathbf{R}_m^{(1)} & \cdots \ \vdots & \vdots & \ddots \end{bmatrix} $$ 矩阵维度：$d \times d$。

对查询向量 $\mathbf{q}$ 和键向量 $\mathbf{k}$ 按位置旋转： $$ \mathbf{q}_m = \mathbf{R}_m \mathbf{q}, \quad \mathbf{k}_n = \mathbf{R}_n \mathbf{k} $$
子空间计算示例（第 $i$ 维）： $$ \begin{bmatrix} q_{m,2i}’ \ q_{m,2i+1}’ \end{bmatrix} = \begin{bmatrix} q_{2i} \cos m\theta_i - q_{2i+1} \sin m\theta_i \ q_{2i} \sin m\theta_i + q_{2i+1} \cos m\theta_i \end{bmatrix} $$ 实际代码通过复数乘法优化效率（见后文）。

旋转后的点积仅依赖相对位置 $n-m$： $$ \mathbf{q}_m^\top \mathbf{k}_n = \mathbf{q}^\top \mathbf{R}_m^\top \mathbf{R}n \mathbf{k} = \mathbf{q}^\top \mathbf{R}{n-m} \mathbf{k} $$ 因 $\mathbf{R}_m^\top \mathbf{R}n = \mathbf{R}{n-m}$（旋转矩阵正交性与乘法性质）。

由旋转矩阵性质： $$ \mathbf{R}m^\top \mathbf{R}n = \mathbf{R}{-m} \mathbf{R}n = \mathbf{R}{n-m} $$ 点积转化为： $$ \mathbf{q}^\top \mathbf{R}{n-m} \mathbf{k} = \sum_{i=0}^{d/2-1} \mathbf{q}^{(i)\top} \mathbf{R}_{n-m}^{(i)} \mathbf{k}^{(i)} $$ 每个子空间贡献与 $n-m$ 相关的旋转量。

词向量对 $\mathbf{x}^{(i)} = [x_{2i}, x_{2i+1}]$ 可表示为复数 $z_i = x_{2i} + i x_{2i+1}$。
旋转操作等价于复数乘法： $$ z_i’ = z_i \cdot e^{i m\theta_i} = (x_{2i} + i x_{2i+1}) (\cos m\theta_i + i \sin m\theta_i) $$ 实部与虚部分解后与矩阵形式一致。

设 $\mathbf{q} = [1, 2]$, $\theta_i=0.5$, $m=1$：

旋转矩阵： $$ \mathbf{R}_1 = \begin{bmatrix} \cos 0.5 & -\sin 0.5 \ \sin 0.5 & \cos 0.5 \end{bmatrix} \approx \begin{bmatrix} 0.8776 & -0.4794 \ 0.4794 & 0.8776 \end{bmatrix} $$
旋转结果： $$ \mathbf{q}_m = \begin{bmatrix} 1 \times 0.8776 - 2 \times 0.4794 \ 1 \times 0.4794 + 2 \times 0.8776 \end{bmatrix} = \begin{bmatrix} -0.0812 \ 2.2366 \end{bmatrix} $$

RoPE通过几何变换将位置信息编码为向量方向，在数学优雅性与实际效果间取得平衡，已成为LLaMA、ChatGLM等主流模型的核心组件。