watch: RetNet

Yannic

Source video: Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)

Remove softmax outside the attention scores, then no all the results have to be hold and wait for softmax.

T p r a a r n a i l n l L T f g i i r o s n a r m e n m a s e r - r T R r I L e a n o t n f w N s e e f r C t o e o r n s m c t e e r R N e e c t p u w S e r o t r r r r f e k o o n n r t g m a n c e
  • RetNet is a kind of linear transforemr, like RWKV.

  • Recurrent network each time train only 1 token because once the next work has been predicted, the backpropagation has to be done to optimize previous hidden states.

    w h s o i t r d a d d t s e e : n s b p a r c o k p
  • Recurrent network cannot be trained parallelly because the non-linearity activation function

    G(c( G(b( G(ax+γ)+γ )+γ) +γ) )

  • Hidden state is a shared buffer. The hidden state contains all the previous information, so the memory cost is consistent during training.

    $$ \begin{aligned} ax+\gamma = \gamma \\ by + \gamma = \gamma \\ cz + \gamma = \gamma \\ \end{aligned} $$

  • Transformer can’t be recurrent because the existence of softmax, which requires all the attention scores (“hidden states”) not to be abandoned.

  • RetNet achieved training parallism through matrix multiplication, like a Linear layer.

  • Time-scaling mask replaces causal mask (blocking the subsequent words when doing attention in parallel)

  • RetNet by chunks is a trade-off between recurrent and parallel.


秋刀鱼

Source video: 【论文速览】 RetNet: A Successor to Transformer for Large Language Models2307.08621

Equentions explaination and code walkthrough.

  1. A global state is maintained like recurrent network.
  2. With that, expand the equation of attention: Q K V
  3. Apply singular decomposition
  4. ….