Table of contents
FNet: Mixing Tokens with Fourier Transforms - NAACL 2022
Code: arxiv appendix | official-jax; | keras code
(2023-07-05) Other implementations found by asking bing chat: “Could you give its pytorch code?”
(2023-06-16)
Video Intro
Source video: FNet: Mixing Tokens with Fourier Transforms (Machine Learning Research Paper Explained) - Yannic Kilcher
(2023-07-07)
Abstract
-
Use linear transformations replace self-attention sublayers resulting in speeding up;
-
Use unparameterized Fourier Transform replace self-attention sublayers achieving over 90% accuracy of BERT counterparts.
-
FNet has a light memory footprint (because it doesn’t have parameters?)
Introduction
-
Attention connects each token by the relevance weights of every other token in the input.
-
And more complex mixing help capture the relationship between tokens.
-
Can attention, the relevance-based “token-mixer”, be replaced by simpler linear transformation (๐๐โปยน+๐)?
-
Decent results are gived by replacing attention with twice parametrized (optimizable) matrix multiplications, which are mixing the sequence dimension and then mixing hidden dimension.
A sequence containing 5 tokens, which are 4-dimensional.
-
Use the faster, structured linear transformation FFT without parameters, yielding similar performance of dense layer mixing and good scalability.
-
Contributions:
- attention may not be a necessary component. Hence, seeking new mixing mechanisms is valuable.
- FNet uses FFT to mix token speeding up the training while losing some accuracy.
- Attention do help increase accuracy to some extent.
- FNet scales well to long inputs.
Code from: rishikksh20/FNet-pytorch:
|
|