sum: ELM

Wikipedia-ELM
- Controversy: RBF (1980s) raised the similar idea of ELM.
Dispute about the originality of ELM: Origins of ELM
Portal of ELM
python toolbox: hpelm

Facts

ELM is ⁽¹⁾

a type of single hidden layer feedforward neural network (SLFN).
The parameters (𝐰,b) between input layer and hidden layer are set randomly.
Thus, for N input n-dimensional samples and L hidden nodes, the output of the hidden layer is $𝐇 = 𝐗_{N×n} 𝐖_{n×L}+𝐛\_{n×L}$
Only the number of hidden nodes needs to be predefined manually without other hyper-parameters.
The output weights are initialized randomly and solved based on the pseudo inverse matrix in one-shot.
For a n-dimensional sample 𝐱ⱼ and its target 𝐭ⱼ=[tᵢ₁, tᵢ₂, …, tᵢₘ]ᵀ∈ ℝᵐ,
the output of ELM with L hidden nodes is 𝐨ⱼ = ∑ᵢ₌₁ᴸ 𝛃ᵢ g(𝐰ᵢᵀ⋅𝐱ⱼ + bᵢ), where
- g(⋅) is activation function;
- 𝛃ᵢ is the weights of the ith ouput unit: 𝛃ᵢ=[βᵢ₁, βᵢ₂, …, βᵢₙ]ᵀ;
- 𝐰ᵢ is input weight: 𝐰ᵢ=[wᵢ₁, wᵢ₂, …, wᵢₙ]ᵀ;
- 𝐱ⱼ is a n-dimensional input: 𝐱ⱼ=[xᵢ₁, xᵢ₂, …, xᵢₙ]ᵀ∈ ℝⁿ;
- bᵢ is the bias of the ith hidden unit;
- 𝐨ⱼ is a m-dimensional vector: 𝐨ⱼ=[oᵢ₁, oᵢ₂, …, oᵢₘ]ᵀ∈ ℝᵐ;
The ideal parameters (𝐰,b,𝛃) should satisfy:
∑ᵢ₌₁ᴸ 𝛃ᵢ g(𝐰ᵢᵀ⋅𝐱ⱼ + bᵢ) = 𝐭ⱼ
For total N samples, this mapping can be reforomalized with matrices:
$𝐇\_{N×L} \pmb\beta\_{L×m} = 𝐓\_{N×m}$, where
- 𝐇 is the output of the hidden layer for N samples:
  
  $$𝐇(𝐰₁,...,𝐰_L, b₁,...,b_L, 𝐱₁,...𝐱_L) = \\\ \begin{bmatrix} g(𝐰₁⋅𝐱₁+b₁) & \dots & g(𝐰_L⋅𝐱₁+b\_L)\\\ \vdots & \ddots & \vdots\\\ g(𝐰₁⋅𝐱_N+b₁) & \dots & g(𝐰_L⋅𝐱_N+b\_L) \end{bmatrix}_{N×L}$$
- 𝛃 is the output weights matrix:
  [ 𝛃₁ᵀ ; … ; 𝛃$\_Lᵀ ]_{L×m}$
- Target data: 𝐓 = $\begin{bmatrix}𝐓₁ᵀ\\\ \vdots \\\𝐓_Nᵀ\end{bmatrix}_{N×m}$
Generally, $𝐇\_{N×m}$ is not a square matrix (not invertible). Hence, 𝛃=𝐇⁻¹𝐓 cannot be applied. However, the optimal 𝛃 can be approached by minimizing the traning error iteratively: ∑ⱼ₌₁ᴺ‖𝐨ⱼ-𝐭ⱼ‖.
Best estimation: $\\^𝐰ᵢ, \\^bᵢ$, ^𝛃ᵢ satisfy:
‖𝐇(^𝐰ᵢ, ^bᵢ)⋅^𝛃ᵢ- 𝐓‖ = min_{𝐰ᵢ, bᵢ, 𝛃ᵢ} ‖𝐇(𝐰ᵢ, bᵢ)⋅𝛃ᵢ- 𝐓‖, where i=1,…,L
Loss function: J = ∑ⱼ₌₁ᴺ (∑ᵢ₌₁ᴸ 𝛃ᵢ⋅g(𝐰ᵢ⋅𝐱ⱼ + bᵢ) - 𝐭ⱼ)²
Solve 𝛃 based on the ∂J/∂𝛃=0, such that the optimal parameter is:
^𝛃 = $𝐇^† 𝐓$ = (𝐇ᵀ𝐇)⁻¹𝐇ᵀ 𝐓,
where $𝐇^†$ is the Moore-Penrose inverse (Pseudo-inverse) of 𝐇.
It can be proved that the norm of ^𝛃 is the smallest and unique solution (for a set of random (𝐰ᵢ, bᵢ)).

Moore-Penrose inverse

Also called pseudoinverse or generalized inverse ⁽²⁾.

(bilibili search: “伪逆矩阵”) 深度学习-啃花书0103伪逆矩阵最小二乘

(DDG search: “伪逆矩阵”)

伪逆矩阵的意义及求法？ - 知乎

numpy.linalg.pinv()

pinv(𝐗) = (𝐗ᵀ 𝐗)⁻¹ 𝐗ᵀ
pinv(𝐗) 𝐗 = 𝐈 python之numpy之伪逆numpy.linalg.pinv - 千行百行 - CSDN

Example Code

This matlab code ⁽¹⁾ trains and tests a ELM on the NIR spectra dataset (regression) and the Iris dataset (classification).

Note that each column is a sample, and each row is an attribute/feature.

Notations:

Q: number of samples
R: input features
S: output features
$P\_{R×Q}$: input pattern matrix
$T\_{S×Q}$: target data matrix
N: number of hidden nodes
TF: transfer function
$IW\_{N×R}$: input weights matrix
$B\_{N×Q}$: bias matrix
$LW\_{N×S}$: transposed output weights matrix

Train (calculate the LW):

$tempH\_{N×Q} = IW\_{N×R}⋅P\_{R×Q} + B\_{N×Q}$
$H\_{N×Q} = TF(tempH)$
$LW\_{S×N} = T\_{S×Q}$⋅ pinv(H), based on: 𝛃$\_{S×N} 𝐇\_{N×Q} = 𝐓\_{S×Q}$

Test:

$tempH\_{N×Q} = IW\_{N×R}⋅P\_{R×Q} + B\_{N×Q}$
$H\_{N×Q} = TF(tempH)$
$Y\_{S×Q} = LW\_{S×N}⋅H\_{N×Q}$

Example code (py)

Build an Extreme Learning Machine in Python | by Glenn Paul Gara … searched by DDG: “incremental elm python”

I-ELM

incremental just means adding neurons?

github

OS-ELM

On-line elm

Deep incremental RVFL

Deep incremental random vector functional-link network: A non-iterative constructive sketch via greedy feature learning

Reference

(Back to Top)

Table of contents