sum: ELM

Wikipedia-ELM
- Controversy: RBF (1980s) raised the similar idea of ELM.
Dispute about the originality of ELM: Origins of ELM
Portal of ELM
python toolbox: hpelm

Facts

ELM is ⁽¹⁾

a type of single hidden layer feedforward neural network (SLFN).
The parameters (𝐰,b) between input layer and hidden layer are set randomly.
Thus, for N input n-dimensional samples and L hidden nodes, the output of the hidden layer is $𝐇 = 𝐗_{N×n} 𝐖_{n×L}+𝐛_{n×L}$
Only the number of hidden nodes needs to be predefined manually without other hyper-parameters.
The output weights are initialized randomly and solved based on the pseudo inverse matrix in one-shot.
For a n-dimensional sample 𝐱ⱼ and its target 𝐭ⱼ=[tᵢ₁, tᵢ₂, …, tᵢₘ]ᵀ∈ ℝᵐ,
the output of ELM with L hidden nodes is 𝐨ⱼ = ∑ᵢ₌₁ᴸ 𝛃ᵢ g(𝐰ᵢᵀ⋅𝐱ⱼ + bᵢ), where
- g(⋅) is activation function;
- 𝛃ᵢ is the weights of the ith ouput unit: 𝛃ᵢ=[βᵢ₁, βᵢ₂, …, βᵢₙ]ᵀ;
- 𝐰ᵢ is input weight: 𝐰ᵢ=[wᵢ₁, wᵢ₂, …, wᵢₙ]ᵀ;
- 𝐱ⱼ is a n-dimensional input: 𝐱ⱼ=[xᵢ₁, xᵢ₂, …, xᵢₙ]ᵀ∈ ℝⁿ;
- bᵢ is the bias of the ith hidden unit;
- 𝐨ⱼ is a m-dimensional vector: 𝐨ⱼ=[oᵢ₁, oᵢ₂, …, oᵢₘ]ᵀ∈ ℝᵐ;
The ideal parameters (𝐰,b,𝛃) should satisfy:
∑ᵢ₌₁ᴸ 𝛃ᵢ g(𝐰ᵢᵀ⋅𝐱ⱼ + bᵢ) = 𝐭ⱼ
For total N samples, this mapping can be reforomalized with matrices:
$𝐇_{N×L} \pmb\beta_{L×m} = 𝐓_{N×m}$, where
- 𝐇 is the output of the hidden layer for N samples:
  $$𝐇(𝐰₁,…,𝐰_L, b₁,…,b_L, 𝐱₁,…𝐱_L) = \\ \begin{bmatrix} g(𝐰₁⋅𝐱₁+b₁) & \dots & g(𝐰_L⋅𝐱₁+b_L)\\ \vdots & \ddots & \vdots\\ g(𝐰₁⋅𝐱_N+b₁) & \dots & g(𝐰_L⋅𝐱_N+b_L) \end{bmatrix}_{N×L}$$
- 𝛃 is the output weights matrix:
  [ 𝛃₁ᵀ ; … ; 𝛃$_Lᵀ ]_{L×m}$
- Target data: 𝐓 = $\begin{bmatrix}𝐓₁ᵀ\\ \vdots \\𝐓_Nᵀ\end{bmatrix}_{N×m}$
Generally, $𝐇_{N×m}$ is not a square matrix (not invertible). Hence, 𝛃=𝐇⁻¹𝐓 cannot be applied. However, the optimal 𝛃 can be approached by minimizing the traning error iteratively: ∑ⱼ₌₁ᴺ‖𝐨ⱼ-𝐭ⱼ‖.
Best estimation: $\^𝐰ᵢ, \^bᵢ$, ^𝛃ᵢ satisfy:
‖𝐇(^𝐰ᵢ, ^bᵢ)⋅^𝛃ᵢ- 𝐓‖ = min_{𝐰ᵢ, bᵢ, 𝛃ᵢ} ‖𝐇(𝐰ᵢ, bᵢ)⋅𝛃ᵢ- 𝐓‖, where i=1,…,L
Loss function: J = ∑ⱼ₌₁ᴺ (∑ᵢ₌₁ᴸ 𝛃ᵢ⋅g(𝐰ᵢ⋅𝐱ⱼ + bᵢ) - 𝐭ⱼ)²
Solve 𝛃 based on the ∂J/∂𝛃=0, such that the optimal parameter is:
^𝛃 = $𝐇^† 𝐓$ = (𝐇ᵀ𝐇)⁻¹𝐇ᵀ 𝐓,
where $𝐇^†$ is the Moore-Penrose inverse (Pseudo-inverse) of 𝐇.
It can be proved that the norm of ^𝛃 is the smallest and unique solution (for a set of random (𝐰ᵢ, bᵢ)).