watch: RetNet

Yannic

Remove softmax outside the attention scores, then no all the results have to be hold and wait for softmax.

RetNet is a kind of linear transforemr, like RWKV.
Recurrent network each time train only 1 token because once the next work has been predicted, the backpropagation has to be done to optimize previous hidden states.
Recurrent network cannot be trained parallelly because the non-linearity activation function

G(c( G(b( G(ax+γ)+γ )+γ) +γ) )
Hidden state is a shared buffer. The hidden state contains all the previous information, so the memory cost is consistent during training.

$$ \begin{aligned} ax+\gamma = \gamma \\ by + \gamma = \gamma \\ cz + \gamma = \gamma \\ \end{aligned} $$
Transformer can’t be recurrent because the existence of softmax, which requires all the attention scores (“hidden states”) not to be abandoned.
RetNet achieved training parallism through matrix multiplication, like a Linear layer.
Time-scaling mask replaces causal mask (blocking the subsequent words when doing attention in parallel)
RetNet by chunks is a trade-off between recurrent and parallel.

Equentions explaination and code walkthrough.