Table of contents
Yannic
Source video: Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)
Remove softmax outside the attention scores, then no all the results have to be hold and wait for softmax.
-
RetNet is a kind of linear transforemr, like RWKV.
-
Recurrent network each time train only 1 token because once the next work has been predicted, the backpropagation has to be done to optimize previous hidden states.
-
Recurrent network cannot be trained parallelly because the non-linearity activation function
G(c( G(b( G(ax+γ)+γ )+γ) +γ) )
-
Hidden state is a shared buffer. The hidden state contains all the previous information, so the memory cost is consistent during training.
$$ \begin{aligned} ax+\gamma = \gamma \\\ by + \gamma = \gamma \\\ cz + \gamma = \gamma \\\ \end{aligned} $$ -
Transformer can’t be recurrent because the existence of softmax, which requires all the attention scores (“hidden states”) not to be abandoned.
-
RetNet achieved training parallism through matrix multiplication, like a Linear layer.
-
Time-scaling mask replaces causal mask (blocking the subsequent words when doing attention in parallel)
-
RetNet by chunks is a trade-off between recurrent and parallel.
秋刀鱼
Source video: 【论文速览】 RetNet: A Successor to Transformer for Large Language Models2307.08621
Equentions explaination and code walkthrough.
- A global state is maintained like recurrent network.
- With that, expand the equation of attention: Q K V
- Apply singular decomposition
- ….