IEEE Cybernetics(2015-11-02) | Google Drive | G.Scholar
- This is the first paper of ELM with subnetwork nodes by Yimin Yang.
- The second paper in the series is MltLyr ELM with subnetwork nodes
- The outline of Yang’s research works: Yang-WeChatLive-20221020;
Abstract
- Learning effectiveness and speed of SLFN are bottleneck.
- ELM is fast.
- Grow subnetwork nodes by pulling back residual network error to the hidden layer.
- Better generalization performance with fewr hidden nodes.
Ⅰ. Introduction
-
Bring out the subject:
FNN (universial approximator) ➔ SLFN -
What is an SLFN?
Input layer + hidden layer + output layer -
Math description:
For N arbitary distinct samples {(𝐱ᵢ,𝐭ᵢ)}ᵢ₌₁ᴺ, where 𝐱ᵢ∈ 𝐑ⁿ and 𝐭ᵢ∈ 𝐑ᵐ, the network output is:$𝐟_L(𝐱)$ = ∑ᵢ₌₁ᴸ 𝛃ᵢh(𝐚ᵢ⋅𝐱ⱼ + bᵢ) = ∑ᵢ₌₁ᴸ 𝐇ᵢ⋅𝛃ᵢ, j=1,…,N (1)
- SLFN output is the weighted sum of 𝐿 hidden nodes (perceptrons) with the factor 𝛃.
- The ith perceptron receives the weighted sum of 𝑁 inputs through its parameters (𝐚ᵢ, bᵢ), 𝐚ᵢ∈ 𝐑ⁿ, b∈ 𝐑, and performs activation function h. Its contribution ratio to the all output nodes is 𝛃ᵢ.
-
ELM traits:
NN (all params are adjustable) ➔ partial random networks ➔ ELM is a full-random learning method, where the input weights and bias (𝐚, b) are generated randomly and independent of training data. (Will the Glorot normalization has no effect?) -
ELM advantages:
An unification of FNNs and SVM/LS-SVM -
ELM application:
CV, da,…, online learning -
Problems:
-
The choice of the regularization parameter C which affects the generalization performance of ELM mainly relies on trial-and-error method.
-
How many neurons should be used in ELM. Although Huang suggested to use more than 1000 hidden nodes, whether the number of hidden nodes can be further reduced without affecting learning effectiveness for large-size/high dimension data set.
Several improved ELM methods, like B-ELM pulls the network residual error back to the hidden layer but it only works for regression task, and other methods bring a higher computation complexity when compared to standard ELM.
-
-
Solution: Growing subnetwork hidden nodes to the exisiting network by pulling back the network residual error to hidden layers. A hidden node itself can be formed by several hidden nodes.
-
Contributions:
- Faster than BP, SVM and other ELMs and compatible to regreesion and classification problems.
- The regularized parameter C do not affect the generalization performance of this method.
- This method with m hidden nodes (the desired output dimensionality) can achieve better training accuracy than the original ELM with a large number of hidden nodes.
Ⅱ. Definitions and Basic-ELM
A. Notations and Definitions
-
𝐑 : set of real numbers
-
{(𝐱ᵢ,𝐭ᵢ)}ᵢ₌₁ᴺ : N arbitrary distinct samples,
-
𝐱ᵢ = [xᵢ₁,xᵢ₂,…,xᵢₙ]ᵀ : n-dim input data, 𝐱ᵢ∈ 𝐑ⁿ is a column vector;
-
𝐭ᵢ : m-dim desired output data, 𝐭ᵢ∈ 𝐑ᵐ
-
𝐱 : input data matrix, 𝐱=[𝐱₁,…𝐱_N], 𝐱ᵢ∈ 𝐑ⁿᕁᴺ
-
𝐭 : desired output data matrix, 𝐭=[𝐭₁,…𝐭_N], 𝐭∈ 𝐑ᵐᕁᴺ
-
(^𝐚ₙ,^bₙ) : parameters of the 𝑛th subnetwork hidden node, ^𝐚ₙ∈ 𝐑ⁿᕁᵐ, ^bₙ∈ 𝐑 (suppose the number of hidden nodes equals to the output dimension m, thus the mapping is from n to m.)
-
^𝐚ₙ = [𝐚ₙ₁,…,𝐚ₙₘ], n×m weights matrix (for a n-dimension sample 𝐱ᵢ), 𝐚ₙₘ∈ 𝐑ⁿ
-
𝐞ₙ : residual error of current network output 𝑓ₙ with 𝑛 hidden nodes (for N samples), i.e., 𝐞ₙ=𝐭-𝑓ₙ, 𝐞ₙ∈ 𝐑ᵐᕁᴺ.
-
𝐇 : output matrix of the hidden layer (of SLFN) for tarining set {(𝐱ᵢ,𝐭ᵢ)}ᵢ₌₁ᴺ, 𝐇 = [h(𝐱₁),…,h(𝐱_N)]ᵀ, 𝐇∈ 𝐑ᴺᕁᵐ, 𝐇 = g(𝐱ᵀ𝐚+𝐛)???
-
h(𝐱) : activation function.
ELM feature mapping (or Huang’s transform) -
𝐇ᵢ : the 𝑖th hidden node output w.r.t. inputs, i.e., the 𝑖th column of 𝐇
-
𝐈 : unit matrix
-
sum(𝐞) : the sum of all elements of the matrix 𝐞
B. Basic-ELM
ELM is proposed for single-hidden-layer feedforward networks (SLFNs).
-
The output function of ELM with L hidden nodes for SLFNs is:
$𝑓_L(𝐱)$ = ∑ᵢ₌₁ᴸ βᵢ⋅h(𝐚ᵢ⋅𝐱ⱼ + bᵢ) = ∑ᵢ₌₁ᴸ 𝐇ᵢ⋅𝛃ᵢ, j=1,…,N.
where h(⋅) denotes an activation function, (𝐚ᵢ, bᵢ), 𝐚ᵢ∈ 𝐑ⁿ, bᵢ∈ 𝐑, denotes the ith hidden node parameters, and 𝛃ᵢ is the ith output weight between the ith hidden node and the output nodes.
-
Based on Bartlett’s theory, ELM theory aims to reach not only the smallest training error, but also the smallest norm of output weights (Least square-least norm solution, where the regularization makes an invertible matrix, such that a special solution can be determined.):
Minimize: ‖𝛃‖² + C⋅‖𝐇𝛃 - 𝐭‖²
“then the generalization performance depends on the size of weights rather than the number of nodes.”
Lemma 1 (proved by Huang):
Given an SLFN with nonconstant piecewise continuous hidden nodes 𝐇(𝐱, 𝐚, b), then for any continuous target function 𝑓 and any function sequence 𝐇ₙʳ(𝐱) = 𝐇(𝐱, 𝐚ₙ, bₙ) randomly generated based on any continuous sampling distribution,lim$_{n➝∞}$ ‖𝑓 - (𝑓ₙ₋₁ + 𝐇ₙʳ⋅𝛃ₙ)‖ = 0
holds with probabitliy 1 if:
𝛃ₙ = ⟨𝐞ₙ₋₁, 𝐇ₙʳ⟩ / ‖𝐇ₙʳ‖² (the weight of the 𝑛th node)
where
- “⟨ , ⟩” stands for “dot product” (Frobenius inner product) of two matrices and is a scalar.
- n is the number of hidden nodes in the hidden layer.
- 𝐞ₙ₋₁ is the residual error of the last iteration, i.e., when there were n-1 hidden nodes.
- 𝐇ₙʳ is the output matrix of the current hidden layer (activated but havn’t scaled by 𝛃).
Intuitively, as the residual error reduces, the weight of the newer node gets smaller.
Ⅲ. Proposed ELM Method With Subnetwork Hidden Nodes
A. Structure of the Proposed Method
Motivations:
- Selecting an appropriate number of neurons can resort to optimization algorithms.
- The generalization performance depends on the size of the weights rather than the number of weights.
Inspiration:
“A hidden node itself can be a subnetwork formed by several nodes. And these subnetwork hidden nodes and output weights itself should be the smallest norm, and also aim to reach the smallest training error.”
Objectives:
Given N training samples {(𝐱ᵢ,𝐭ᵢ)}ᵢ₌₁ᴺ, 𝐱ᵢ∈ 𝐑ⁿ, 𝐭ᵢ∈ 𝐑ᵐ, generated from the same continuous system, if activation function h is invertible, the objectives are:
- ‖𝐞ₙ₋₁‖ ≥ ‖𝐞ₙ₋₁ - h(^𝐚ₙ,𝐱)‖ ≥ ‖𝐞ₙ₋₁ - 𝐇ₙ‖ ≥ ‖𝐞ₙ₋₁ - 𝐇ₙ⋅𝛃ₙ‖ (residual error is decreasing.)
- ‖h(^𝐚ₙ,𝐱) - 𝐞ₙ₋₁‖ = min_{𝐚ₙ₁,…,𝐚ₙₘ} ‖h(𝐚ₙ₁,…,𝐚ₙₘ) - 𝐞ₙ₋₁‖ (minimize weights inside nodes)
- ‖𝐇ₙ⋅^𝛃ₙ - 𝐞ₙ₋₁‖ = min_{𝛃} ‖𝐇ₙ⋅𝛃 - 𝐞ₙ₋₁‖ (minimize the weights outside nodes)
where
- ^𝐚ₙ and ^𝛃ₙ are the optimal (the ultimate status) parameters with the smallest norm among all the least squares solutions.
- 𝐇ₙ = h(^𝐚ₙ, ^bₙ, 𝐱) is the output of the nth hidden node with the optimal parameters.
If activation function h is invertible, subnetwork hidden nodes in SLFN can be calculated by pulling back network residual error to hidden layers.
-
For example, with sine function as the activation function, training a subnetwork hidden node (^𝐚) is equivalent to finding a least-square solution ^𝐚ₙ (letting the derivative of MSE=0) with the least norm for the linear system:
[𝐚ₙ₁,…,𝐚ₙₘ]⋅𝐱 = arcsin(𝐞ₙ₋₁), 𝐞ₙ₋₁∈ (0,1],
such that the optimal ^𝐚ₙ satifies:
‖sin(^𝐚ₙ, 𝐱) - 𝐞ₙ₋₁‖ = min_{𝐚ₙ₁,…,𝐚ₙₘ} ‖sin(𝐚ₙ₁,…,𝐚ₙₘ, 𝐱) - 𝐞ₙ₋₁‖,
That means the output of the nth “subnetwork hidden node” ^𝐚ₙ is approaching the residual error 𝐞ₙ₋₁ of the last status.
The input weights ^𝐚ₙ of a node for this model is a matrix (instead of a vector), beacuse each “subnetwork (general) hidden node” contains a standard SLFN (several hidden nodes) internally.
Differences with standard ELM
| ELM with subnetwork | standard ELM | |
|---|---|---|
| hidden node | m neurons: $𝐚_f∈ 𝐑ⁿᕁᵐ, 𝐛_f∈ 𝐑ᵐ$ | single neuron: 𝐚∈ 𝐑ⁿ, b∈ 𝐑 |
| construct | calculated | generated randomly |
| # hidden nodes | L x m (m ⟂ L, m = #output dim) | L |
B. Proposed Method
Lemma 2:
Given a bounded nonconstant piecewise continuous activation function h, there is:
lim$_{(𝐚,b)→(𝐚₀,b₀)}$ ‖h(𝐚⋅𝐱+b) - h(𝐚₀⋅𝐱+b₀)‖ = 0 (连续性)
Theorem 1:
Given N arbitrary distinct samples {(𝐱ᵢ,𝐭ᵢ)}ᵢ₌₁ᴺ, 𝐱ᵢ∈ 𝐑ⁿ, 𝐭ᵢ∈ 𝐑ᵐ, a sigmoid or sine activation function h, and then for any continuous desired outputs 𝐭, the limit of error converges to 0:lim$_{n➝∞}$ ‖ 𝐭-{ u⁻¹(h(^𝐚⋅𝐱+b))} ‖ = 0
Proof:
-
Prove the sequence ‖𝐞ₙ‖ is decreasing with 0 as the lower bound and it converges.
-
For the 𝑛th subnetwork hidden node containing m hidden nodes, the linear mapping is:
𝛌ₙ = [𝐚ₙ₁,…,𝐚ₙₘ]⋅𝐱, 𝛌ₙ∈ 𝐑ᵐ
-
Then 𝛌ₙ passes through the activation function. Because the target is error, which should become 0 at the end, the error at present is the output of activation function:
𝐞ₙ₋₁ = h(𝛌ₙ) ∈ 𝐑ᵐ
-
The inverse function of h is h⁻¹, and its input value should range from (0,1]. Therefore, if trying to solve 𝛌ₙ from 𝐞ₙ₋₁, every element in 𝐞ₙ₋₁ should be scaled to the range of (0,1] by the normalized function u(⋅). Then, 𝛌ₙ can be calculated through:
𝛌ₙ = h⁻¹(u(𝐞ₙ₋₁))
-
Further, the input weights of this subnetwork hidden node can be solved:
$$\^𝐚ₙ = [𝐚ₙ₁,…,𝐚ₙₘ] = h⁻¹(u(𝐞ₙ₋₁))⋅𝐱⁻¹$$
-
For different activation functions, there will be:
$$\rm \{^{\^𝐚ₙ = arcsin(u(𝐞ₙ₋₁))⋅𝐱⁻¹,\quad sine} _{\^𝐚ₙ = -log( (1/u(𝐞ₙ₋₁)) - 1)⋅𝐱⁻¹,\quad sigmoid}$$
(This work is a continuation on ELM, that is once the 𝛃 calculated based on target 𝐭 and 𝐇⁻¹, the residual error is also fixed, so it serves as the target for 𝐚,b. Still, applying least-square, the optimial a can be calculated based on “target” e and 𝐱⁻¹.)
-
^b is the mean of hidden nodes.
-
The error can be reduced by adding the bias
-
Do feedforward using the calculated ^𝐚ₙ and ^bₙ, so this time the output of the hidden layer is:
$$\^𝐇ₙᵉ = u⁻¹(h(\^𝐚ₙ⋅𝐱+\^bₙ))$$
Because 𝐞ₙ₋₁ is the last output of the activation fuction, 𝐞ₙ₋₁ and $\^𝐇ₙᵉ$ are the same things. So they can subtract from each other. Then the residual error for this time is:
Δ = ‖𝐞ₙ₋₁‖² - ‖𝐞ₙ₋₁ - ^𝐇ₙᵉ‖²
= ‖𝐞ₙ₋₁‖² - (‖𝐞ₙ₋₁‖² - 2‖𝐞ₙ₋₁‖‖^𝐇ₙᵉ‖ + ‖^𝐇ₙᵉ‖²)
= 2‖𝐞ₙ₋₁‖‖^𝐇ₙᵉ‖ - ‖^𝐇ₙᵉ‖²
= 2⟨𝐞ₙ₋₁, ^𝐇ₙᵉ⟩ - ‖^𝐇ₙᵉ‖²
= ‖^𝐇ₙᵉ‖² ( 2⟨𝐞ₙ₋₁, ^𝐇ₙᵉ⟩/‖^𝐇ₙᵉ‖² - 1 ) -
Δ is ≥ 0
-
-
Prove the limit converges to 0 when n tends to infinity.
The target value is approximated while the error is decreased.
The final estimation is the summation of the ouput of d subnetwork hidden nodes
The VC dimension is lower than standard ELM, i.e., the dimension of feature space m≪ L, so the generalization ability of this method is better.