watch: Transf - Nick 10 | Self-attention Mechanism

value 是一个 d 维的向量；现分析 n 个 value:

self-attention 的目的是为了使每个 value 带上所有 value 的信息，也就是所有 value 加权相加，相似部分的权重大些，无关部分的权重小些。

权重即是各 value 与 value 之间的相似度，只需每个 value 都与所有 value 做内积就可得到，内积大说明两个value夹角小，含义相似；如果各value之间正交，比如one-hot编码，那么只有自己和自己乘结果是1，其余都是0。

用矩阵运算表示就是“自己乘以自己的转置” (n,d)⋅(n,d)ᵀ→(n,n)，得到 nxn 的注意力分数方阵

把这套机制放到Attention运算中，queries 和 keys 都等于 values。

~~但是这些权重是不变的，因为每次 querys 都等于 keys，对协方差矩阵做完softmax 得到的分布总是一样的？~~ ~~doubt: 实验：两个矩阵分别自己和自己做矩阵乘法，然后做softmax，观察两个结果~~

如果仅凭 values 自己调整，要达到任务所需的向量表示，可能速度太慢，所以分别给 q,k,v 加了一层线性变换，反向传播时也调整这个线性层的权重，让 value 的新词向量更快的移动到准确的位置。“从而提升模型的拟合能力¹"。

Self-attention 中每一个 input 都与所有 input 做内积，没有考虑到 input 的顺序，所以原始文本的顺序信息丢失了，所以需要位置编码 ¹

例子：

两个单词 thinking 与 machines，分别乘上 Wq, Wk, Wv 得到线性变换后的 queries (q1,q2), keys (k1,k2), values (v1,v2)

s11 = q1 × k1, s12 = q1 × k2; 然后 s11, s12 做scale，再做softmax得到两个权值 s11’ (0.88), s12’ (0.12), 则 z1 = s11’ × v1 + s12’ × v2 就是 thinking 的新的向量表示。

对于 thinking，初始的词向量（one-hot, Elmo）为 x1，这个x1 与其他词的向量正交，无关，不包含其他单词的任何信息，对 thinking machines 这两个词做完 self-attention 之后，thinking 新的词向量带上了 machines 的信息，带上了多少呢？带上了 machines 与 thinking 相似的部分。或者说：这个新词向量蕴含了 thinking machines 这句话对于 thinking 而言哪个词更重要的信息。

新的词向量包含了整个句子所有单词的信息，重要的单词占比多一点。

Attention 与 Self-attention 区别：

QKV相乘就是注意力，只是一种运算，但并没有规定 QKV 是怎么来的。通过一个查询变量 Q 去找到 V 里面比较重要的东西：QK相乘求相似度 S，然后 SV 相乘得到 V 的新的向量表示（对于词向量，则它包含了句法/语义特征）。Q 可以是任何一个东西，V也是任何一个东西，K 往往是和V 同源的

自注意力要求 QKV 同源，QKV 是对同一个 X 乘上了不同的权重矩阵，做了不同的线性变换，使它们在空间中岔开了，X 就是一个词向量，但它表达的信息可能没那么准确，通过反向传播调整它们的权重矩阵，把它们移动了合适的位置上，这个位置的词向量可以为我的任务准确地分配各部分的重要性。

交叉注意力机制：Q 和 V 不同源，但 K 和 V 同源

Self-Attention 代码实现pytorch ¹

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


from math import sqrt
import torch
import torch.nn

class SelfAttentionLayer(nn.Module):
  def __init__(self, input_dim, dim_k, dim_v):
    '''
    Inputs:
        input_dim: dim of input feature vector
        dim_k: dim of key is same as query, because they do dot product
        dim_v: dim of value, suitable for task
    '''
    super(SelfAttentionLayer, self).__init__()
    self.q = nn.Linear(input_dim, dim_k)
    self.k = nn.Linear(input_dim, dim_k)
    self.v = nn.Linear(input_dim, dim_v)
    self._norm_fact = 1/sqrt(dim_k)

  def forward(self, x):
    '''
    Input:
        x: (batch_size, seq_len, input_dim)
    '''
    Q = self.q(x)   # (batch_size, seq_len, dim_k)
    K = self.k(x)   # (batch_size, seq_len, dim_k)
    V = self.v(x)   # (batch_size, seq_len, dim_v)
    
    atten = nn.Softmax(dim=-1)(torch.bmm(Q,K.permute(0,2,1))) * self._norm_fact # (bs,seq_len,seq_len)
    output = torch.bmm(atten, V)    # Q * K.T() * V, (bs, seq_len, dim_v)

    return output

if __name__ == '__main__':
  X = torch.randn(4,3,2)
  self_atten = SelfAttentionLayer(input_dim=2, dim_k=4, dim_v=5)
  res = self_atten(X)
  print(res.size())

Multi-head self-attention:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


class SelfAttentionMultiHead(nn.Module):
  def __init__(self, input_dim, dim_k, dim_v, num_heads):
    super(SelfAttentionMultiHead, self).__init__()
    assert dim_k % num_heads == 0   # dim_k is divided by num_heads
    assert dim_v % num_heads == 0
    self.q = nn.Linear(input_dim, dim_k)
    self.k = nn.Linear(input_dim, dim_k)
    self.v = nn.Linear(input_dim, dim_v)

    self.num_heads = num_heads
    self.dim_k = dim_k
    self.dim_v = dim_v
    self._norm_fact = 1/sqrt(dim_k)

  def forward(self, x):
    '''
    Input:
        x: (batch_size, seq_len, input_dim)
    '''
    # 过线性层之后，拆分成（num_heads, bs, seq_len, dim_k÷num_heads）
    Q = self.q(x).reshape(-1, x.shape[0], x.shape[1], self.dim_k // self.num_heads)
    K = self.k(x).reshape(-1, x.shape[0], x.shape[1], self.dim_k // self.num_heads)
    V = self.v(x).reshape(-1, x.shape[0], x.shape[1], self.dim_v // self.num_heads)
    print(x.shape)
    print(Q.size())
    
    # Q * K.T()
    atten = nn.Softmax(dim=-1)(torch.matmul(Q, K.permute(0,1,3,2))) # (bs, seq_len, seq_len)

    # Q * k.T() * V
    output = torch.matmul(atten, V).reshape(x.shape[0], x.shape[1], -1) # (bs, seq_len, dim_v)

    return output

Self-Attention - TF ²

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def single_head_attention(z):   # z: [None, n, dm]

  Q = tf.keras.layers.Dense(units = dk)(z)  # Q: [None, n, dk]
  K = tf.keras.layers.Dense(units = dk)(z)
  V = tf.keras.layers.Dense(units = dv)(z)

  score = tf.matmul(Q, K, transpose_b = True)/ tf.sqrt(dk * 1.0)  # score : [None, n, n]

  W = tf.nn.softmax(score, axis = -1)  # W: [None, n, n]

  H = tf.matmul(W, V)   # H: [None, n, dv]
  return H

Multi-head self-attention ³

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, dk, dv, num_heads):
    '''
    Inputs:
        dk: dim of keys after linear layer
        num_heads: the dk is divided into num_heads parts
    '''
    super(MultiHeadAttention, self).__init__()
    assert dk % num_heads == 0
    assert dv % num_heads == 0

    self.dk = dk
    self.dv = dv
    self.num_heads = num_heads

    self.wq = tf.keras.layers.Dense(dk)
    self.wk = tf.keras.layers.Dense(dk)
    self.wv = tf.keras.layers.Dense(dv)

  def call(self, x):
    '''
    x: a batch of sequences that need to do self-attention (batch_size, seq_len, d_input)
    '''
    seq_len = tf.shape(x)[1]

    q = self.wq(x)  # (batch_size, seq_len, dk)
    k = self.wk(x)  # (batch_size, seq_len, dk)
    v = self.wv(x)  # (batch_size, seq_len, dv)
    
    # Split the last dimension and transpose to (bs, num_heads, seq_len_q, dk)
    Q = tf.transpose(tf.reshape(q, (-1, seq_len, self.num_heads, self.dk//self.num_heads)), perm=[0,2,1,3])
    K = tf.transpose(tf.reshape(k, (-1, seq_len, self.num_heads, self.dk//self.num_heads)), perm=[0,2,1,3])
    V = tf.transpose(tf.reshape(v, (-1, seq_len, self.num_heads, self.dv//self.num_heads)), perm=[0,2,1,3])

    # Attention
    score = tf.matmul(Q, K, transpose_b=True)/ tf.sqrt(self.dk // self.num_heads * 1.0) # scale by the "dk" of one head
    attention_weights = tf.keras.activations.softmax(score)
    scaled_attention = tf.matmul(attention_weights, V)

    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, d_subspace)

    concat_attention = tf.reshape(scaled_attention, (-1, seq_len, self.dv))  # (batch_size, seq_len_q, dv)

    return concat_attention, attention_weights

Refer:

Table of contents

Attention 与 Self-attention 区别：

Self-Attention 代码实现pytorch 1

Self-Attention - TF 2

Self-Attention 代码实现pytorch ¹

Self-Attention - TF ²