read: Multilayer Subnetwork Nodes

Authors: Yimin Yang, and Q. M. Jonathan Wu

IEEE Cybernetics; Publish Date: 2015-10-09.

This is the 2nd paper in his series, and the first paper is this.

(感觉Intro写得不错，逻辑性强，信息量大；但后面method部分好多typo)

Abstract

Representation learning of multilayer ELM with subnetwork nodes outperform conventional feature learning methods.

I. Introduction

model performance ➔ data representaiton/features ➔ processing pipelines design and data transformations ➔ data representation ➔ effective learning

Feature reduction and extraction techniques can be conducted in a supervised, unsupervised or semi-supervised manner.
ELMs learn representations of data to extract useful information when building classifiers or predictors.
ELMs provide a unified learning framework for “generalized” single-hidden layer feedforward NNs (SLFNs).
- In ELM methods, the hidden layer parameters of NN need not be tuned during training, but generated randomly.
ML-ELM is adding (multiple) general hidden nodes (subnetwork nodes) to existing single-hidden-layer ELM networks.
1. A versatile platform with faster speed and better generalization performance on feature extraction.
2. Its generalization performance is not sensitive to the parameters of the networks in the learning process.
3. ML-ELM has universal approximation capability and representation learning.

II. Preliminaries and Basic-ELM

A. Notations

𝐑 : set of real numbers
{(𝐱ᵢ,𝐲ᵢ)ᵢ₌₁ᴹ} (𝐱ᵢ∈ 𝐑ⁿ,𝐲ᵢ∈ 𝐑ᵐ) : M arbitrary distinct samples,
𝐱 : an input data matrix 𝐱∈ 𝐑ⁿᕁᴹ
𝐲 : desired output data matrix 𝐲∈ 𝐑ᵐᕁᴹ
𝛂ᵢ : weight vector connecting the 𝑖th hidden nodes and the input nodes
♭ᵢ : bias of the 𝑖th hidden nodes
βᵢ : output weight between the 𝑖th hidden node and the output node
𝐞 : residual error of current network output, i.e., 𝐞=𝐲-𝐟
𝐈 : unit matrix
sum(𝐞) : the sum of all elements of the matrix 𝐞
g : sigmoid or sine activation function

(TABLE 1)

(𝛂,♭) : a hidden node (in basic ELM)
(𝐚,𝑏) : a general hidden node (or subnetwork node)
^𝐚ʲ_f : input weight of the jth general hidden node in feature mapping layer. ^𝐚ʲ_f∈ 𝐑ᵈᕁⁿ
^bʲ_f : bias of the 𝑗th general hidden node in feature mapping layer ^bʲ_f∈ 𝐑
(𝛂ᵢʲ_f,♭ʲ_f) : the 𝑖th general hidden node in the 𝑗th general hidden node.
(^𝐚ₕ,^𝑏ₕ) : hidden nodes in ELM learning layer and ^𝐚ₕ∈ 𝐑 ᵐᕁᵈ
uⱼ : normalized function in the 𝑗th general node, uⱼ(⋅):𝐑 ➔ (0,1], uⱼ⁻¹ represent its reverse function
𝐇ʲ_f : feature data generated by 𝑗general nodes in a feature mapping layer, i.e., 𝐇ʲ_f = ∑ᵢ₌₁ʲ uᵢ⁻¹ ⋅ g(𝐱, ^𝐚ⁱ_f, ^bⁱ_f), 𝐇ʲ_f∈ 𝐑ᵈᕁᴹ
𝐇ʲⁱ_f : feature data generated by the 𝑖th feature mapping layer
M : number of training samples
n : input data dimension
m : output data dimension
d : feature data dimension
𝐞_L : the residual error of current two-layer network (L general nodes in the first layer and (𝐚ₕ,𝑏ₕ) in the second layer)
𝐞ʲ_L : the residual error of current two-layer network (L general nodes in the first layer and 𝑗general nodes in the second layer)
L : the numbers of general hidden nodes

B. Basic-ELM

The output function of ELM for SLFNs fed with input matrix 𝐱 is:
fₙ(𝐱)=∑ᵢ₌₁ⁿ βᵢ g(𝐱, 𝛂ᵢ, ♭ᵢ).

“ELM theory aims to reach the smallest training error but also the smallest norm of output weights” (regularization term?), so the objective is to minimize:
‖βᵢ‖ₚᶣ¹ + C⋅‖∑ᵢ₌₁ⁿ βᵢ g(𝐱, 𝛂ᵢ, ♭ᵢ) - 𝐲‖ᶣ²_q, i=1,…,n. (ᶣ signifies μ)

where μ₁>0, μ₂>0, p,q = 0, ½, 1, 2, …, +∞, C is a positive value, g(𝐱, 𝛂, ♭) is referred to as ELM feature mapping (linear projection+activation) or Huang’s transform.

(Convergence proved by Huang et al.)

Lemma 1: Given M aribitrary distinct samples {(𝐱, 𝐲)}, 𝐱∈ 𝐑ⁿᕁᴹ, 𝐲∈ 𝐑ᵐᕁᴹ sampled from a continuous system, an activation function g, then for any continous target function 𝐲 and any function sequence g(𝐱, 𝛂ₙʳ, ♭ₙʳ) randomly generated based on any continuous sampling distribution, lim_{n➝∞} ‖𝐲-（fₙ₋₁ + g(𝐱, 𝛂ₙʳ, ♭ₙʳ)）‖=0 holds with probabiltiy one if
βₙ = ⟨𝐞ₙ₋₁, g(𝐱, 𝛂ₙʳ, ♭ₙʳ)⟩ / ‖g(𝐱, 𝛂ₙʳ, ♭ₙʳ)‖²,

where (𝛂ₙʳ, ♭ₙʳ) represesnts the 𝑛th random hidden node, and 𝐞ₙ₋₁ = 𝐲-fₙ₋₁

III. Proposed Method

A. ELM With Subnetwork Nodes

A hidden node can be a subnetwork formed by several hidden nodes. Hence, a single mapping layer can contain multiple networks.

Comparision of the feature mapping layer:

flowchart LR subgraph A[basic ELM] direction BT x1["x₁"]--> h1(("𝛂₁,♭₁, β₁")) & he1((...)) & hL(("𝛂L,♭L, βL")) xe[x...]--> h1(("𝛂₁,♭₁, β₁")) & he1((...)) & hL(("𝛂L,♭L, βL")) xn["xₙ"]--> h1(("𝛂₁,♭₁, β₁")) & he1((...)) & hL(("𝛂L,♭L, βL")) h1 --> y1["y₁"] & ye[...] & ym["yₘ"] he1--> y1["y₁"] & ye[...] & ym["yₘ"] hL --> y1["y₁"] & ye[...] & ym["yₘ"] subgraph A1["ELM feature mapping layer"] h1 & he1 & hL end end subgraph A1["ELM feature mapping layer"] h1 & he1 & hL end subgraph B[ELM with subnetwork nodes] direction BT x1_["x₁"] --> ghn1 & ghne((...)) & ghnL xe_[x...] --> ghn1 & ghne((...)) & ghnL xn_["xₙ"] --> ghn1 & ghne((...)) & ghnL ghn1--> y1_["y₁"] & ye_[...] & ym_["yₘ"] ghne--> y1_["y₁"] & ye_[...] & ym_["yₘ"] ghnL--> y1_["y₁"] & ye_[...] & ym_["yₘ"] end subgraph ghn1["^𝛂¹_f, ^♭¹_f, with weight u₁⁻¹"] direction TB n11(("𝛂¹_f1,\n ♭¹_f1")) & n1e((...)) & n1m(("𝛂¹_fm,\n ♭¹_fm")) end subgraph ghne["general hidden nodes"] direction TB ne1((1)) & nee((...)) & nem((m)) end subgraph ghnL["^𝛂ᴸ_f, ^♭_f, with weight u\_L⁻¹"] direction TB nL1(("𝛂ᴸ_f1,\n ♭ᴸ_f1")) & nLe((...)) & nLm(("𝛂ᴸ_fm,\n ♭ᴸ_fm")) end

Three differences between ELM feature mapping layer and this feature mapping layer.

Difference	Standard ELM	ELM with subnetwork nodes
hidden node	single hidden node generated one by one	general hidden node having subnetwork
# hidden node	Independent to the output dim 𝑚	In a subnetwork, it equals to the output dim
relation	A special case of the subnetwork case

B. Proposed Method for Representation Learning

1) Optimal Projecting Parameters and Optimal Feature Data

flowchart LR x1["x₁"]--> h1(("^𝛂¹_f,^♭¹_f, β¹")) & h2(("^𝛂²_f,^♭²_f, β²")) & he1(("⋮")) & hL(("^𝛂ᴸ_f,^♭ᴸ_f, ^βᴸ")) xe[x...]--> h1(("^𝛂¹_f,^♭¹_f, β¹")) & h2(("^𝛂²_f,^♭²_f, β²")) & he1(("⋮")) & hL(("^𝛂ᴸ_f,^♭ᴸ_f, ^βᴸ")) xn["xₙ"]--> h1(("^𝛂¹_f,^♭¹_f, β¹")) & h2(("^𝛂²_f,^♭²_f, β²")) & he1(("⋮")) & hL(("^𝛂ᴸ_f,^♭ᴸ_f, ^βᴸ")) h1 & h2 & he1 & hL --> feat["d-dimension\n Feature data"] feat --> n1 & n2 & ne & nm --> y1_["y₁"] & ye_["⋮"] & ym_["yₘ"] subgraph A1["ELM feature mapping layer"] h1 & h2 & he1 & hL end subgraph elm["ELM-learning layer"] n1(("𝛂ₕ₁,^♭ₕ")) & n2(("𝛂ₕ₂,^♭ₕ")) & ne(("⋮")) & nm(("𝛂ₕₘ,^♭ₕ")) end

Objective of representation learning: Represent the input features meaningfully in several different representations as follows.

Represen- tation	feat dim (𝑑) vs in-dim (𝑛)	feature	target output
Dimension Reduction	𝑑 < 𝑛	H_f ∈ 𝐑ᵈᕁᴹ	𝐲=label (Supervise) or 𝐲=𝐱 (Unsp~)
Expanded Dimension	𝑑 > 𝑛	H_f ∈ 𝐑ᵈᕁᴹ	𝐲=label (Supervise) or 𝐲=𝐱 (Unsp~)

The feature data is 𝐇_f(𝐱ₖ, ^𝐚_f, ^b_f), where the weights of feature mapping layer ^𝐚ʲ_f, j=1,…,L belongs to 𝐑ᵈᕁⁿ.

Definition 1: Given a nonlinear piecewise continous activation function g, we call
{(^𝐚ʲ_f, ^bʲ_f)ⱼ₌₁ᴸ} (^𝐚ʲ_f ∈ 𝐑ᵈᕁⁿ) the 𝐿 optimal general hidden nodes
and 𝐇⃰ ⃰_f= ∑ᵢ₌₁ᴸ g(𝐱, ^𝐚ʲ_f, ^bʲ_f) the optimal feature data if it satisfies:
‖𝐞_L‖ ≤ min_{𝐇⃰ᴸ_f∈ 𝐑ᵈᕁᴹ} ( min_{𝐚ₕ∈ 𝐑 ᵐᕁᵈ} ‖𝐲-uₕ⁻¹ g(𝐇⃰ᴸ_f, 𝐚ₕ, bₕ)‖ ) (4)

where 𝐞_L = ‖𝐲-uₕ⁻¹ g(𝐇⃰ᴸ ⃰_f, ^𝐚ₕ, ^bₕ)‖ and sequence ‖𝐞_L‖ is decreasing and bounded below by zero.

Remark 1: If the optimal projecting parameters are obtained in the feature mapping layer {(^𝐚ʲ_f, ^bʲ_f)ⱼ₌₁ᴸ} (where ^𝐚_f ∈ 𝐑ᵈᕁⁿ),
the original n-dimension data points 𝐱 will be converted to d-dimension data points: 𝐇⃰_f= ∑ⱼ₌₁ᴸ g(𝐱ₖ, ^𝐚ʲ_f, ^bʲ_f), which satisfy the inequality (4).

Thus the purpose is to find optimal projecting parameters that make the inequality (4) true for all data points.

2) Learning Steps

Based on the inverse of the activation function.

Given M arbitrary distinct training samples {(𝐱ₖ,𝐲ₖ)ₖ₌₁ᴹ}, 𝐱ₖ∈ 𝐑ⁿ, 𝐲ₖ∈ 𝐑ᵐ, which are sampled from a continuous system.

Set j=1 to initialize a general node of the feature mapping layer randomly as:
𝐇⃰ʲ_f = g(^𝐚ʲ_f⋅𝐱 + ^bʲ_f), (^𝐚ʲ_f)ᵀ⋅^𝐚ʲ_f=𝐈, (^bʲ_f)ᵀ⋅^bʲ_f=1,

where ^𝐚ʲ∈ 𝐑ᵈᕁⁿ, ^bʲ_f∈ 𝐑 is the orthogonal random weight and bias of feature mapping layer. 𝐇⃰ʲ_f is current feature data.
Given a sigmoid or sine activation function g, for any continous desired outputs 𝐲, the parameters in the (general) ELM learning layer are obtained as:
- ^𝐚ₕ = g⁻¹(uₙ(𝐲)) ⋅ (𝐇⃰ʲ_f)⁻¹, ^𝐚ʲₕ∈ 𝐑ᵈᕁᵐ,
- ^bₕ = √mse(^𝐚ₕʲ ⋅ 𝐇⃰ʲ_f - g⁻¹(uₙ(𝐲)) ), ^bʲₙ∈ 𝐑,
- $g⁻¹(⋅) = \{^{arcsin(⋅) \quad if\ g(⋅)=sin(⋅)}_{-log(1/(⋅)-1) \quad if\ g(⋅) = 1/(1+e⁻⁽˙⁾)}$, _
where 𝐇⃰⁻¹ = 𝐇⃰ᵀ( (C/𝐈) + 𝐇⃰ 𝐇⃰ᵀ)⁻¹; C is a positive value; uₙ is a normalized function uₙ(𝐲): 𝐑➔(0,1]; g⁻¹ and uₙ⁻¹ represent their reverse function.
Update the output error 𝐞ⱼ as
𝐞ⱼ = 𝐲 - uₙ⁻¹ g(𝐇⃰ʲ_f, ^𝐚ₕ, ^bₕ)
So the error feedback data is 𝐏ⱼ = g⁻¹(uₙ(𝐞ⱼ))⋅(^𝐚ₕ)⁻¹
Set j=j+1, add a new general node (^𝐚ʲ_f, ^bʲ_f) in the feature mapping layer by
- ^𝐚ʲ_f = g⁻¹( uⱼ(𝐏ⱼ₋₁) ) ⋅ 𝐱⁻¹, ^𝐚ʲ_f∈ 𝐑ⁿᕁᵈ
- ^bʲ_f = √mse(^𝐚ʲ_f ⋅ 𝐱 - 𝐏ⱼ₋₁), ^bʲ∈ 𝐑
and update the feature data 𝐇⃰ʲ_f = ∑ᵢ₌₁ʲ uₗ⁻¹ g(𝐱, ^𝐚ˡ_f, ^bˡ_f)
Repeat step 2-4 𝐿-1 times. (Finally, 𝐿 nodes are added into feature mapping layer.) The set of parameters {^𝐚ʲ_f,^bʲ_f}ⱼ₌₁ᴸ are the optimal projecting parameters and the feature data 𝐇⃰ᴸ_f = ∑ⱼ₌₁ᴸ uⱼ⁻¹ g(𝐱, ^𝐚ʲ_f, ^bʲ_f) = 𝐇⃰ ⃰_f are the optimal feature data.

C. Proof of the Proposed Method

(Proof of Convergence)

Given M arbitrary distinct samples {(𝐱ₖ,𝐲ₖ)}ₖ₌₁ᴹ (𝐱ₖ∈ 𝐑ⁿ, 𝐲ₖ∈ 𝐑ᵐ)

Lemma 2: Given a bounded nonconstant piecewise continuous activation function g, we have
lim_{(𝛂,♭)→(^𝛂,^♭)} ‖g(𝐱,𝛂,♭) - g(𝐱,^𝛂,^♭)‖ = 0 where the (^𝛂,^♭) is one of the least-squares solutions of a general linear system 𝛂⋅𝐱+♭.

Remark 2:

Lemma 2 shows that SLFN training problem can be considered as finding optimal hidden parameters which satisfy: g(^𝛂₁,^♭₁) + … + g(^𝛂_L,^♭_L) → 𝐲. 𝛂 (alpha) stands for basic ELM hidden node.
Thus training an SLFN is equivalent to finding a least-square general input weight ^𝐚ₕ of the (linear+activation) system g(^𝐚ₕ⋅𝐱) = 𝐲.
If activation function g is invertible, the input weights matrix can be obtained by pulling back the residual error to the hidden layer.

For example, if g is a sine function,

The output of the hidden layer matrix is 𝐲=sin(𝐚ₕ ⋅ 𝐱).
Thus, 𝐚ₕ⋅𝐱 = arcsin(𝐲), 𝐲∈ (0,1].
The smallest norm least-squares solution of the linear system sin(𝐚ₕ⋅𝐱)=𝐲 is:
^𝐚ₕ = arcsin(𝐲)⋅𝐱⁻¹, where 𝐱⁻¹ is the Moore-Penrose generalized inverse of matrix 𝐱. 𝐱⁻¹ = 𝐱ᵀ( (C/𝐈) + 𝐱𝐱ᵀ)⁻¹

Theorem 1: Given M arbitrary distinct samples {(𝐱ᵢ,𝐲ᵢ)ᵢ₌₁ᴹ}, (𝐱ᵢ∈ 𝐑ⁿ, 𝐲ᵢ∈ 𝐑ᵐ) and a sigmoid or sine activation function g, for any continuous desired outputs 𝐲, we have:
the optimal weights ^𝐚ₕ = argmin_{𝐚ₕ∈ 𝐑ᵐᕁⁿ} ‖u⁻¹(g(𝐱,𝐚ₕ)) - 𝐲‖
least square error ‖g(𝐱,^𝐚ₕ,^bₕ) - 𝐲‖ ≤ min_{𝐚ₕ∈ 𝐑ᵐᕁⁿ} ‖u⁻¹(g(𝐚ₕ⋅𝐱)) - 𝐲‖
if the parameters are obtained by (similar to Algorithm step-2):

^𝐚ₕ = g⁻¹( u(𝐲))⋅𝐱⁻¹, ^𝐚ₕ ∈ 𝐑ᵐᕁⁿ

^bₕ = √mse(^𝐚ₕ⋅𝐱 - g⁻¹(u(𝐲))), ^bₕ∈ 𝐑

$g⁻¹(⋅) = \{^{arcsin(⋅) \quad if\ g(⋅)=sin(⋅)}_{-log(1/(⋅)-1) \quad if\ g(⋅) = 1/(1+e⁻⁽˙⁾)}$, _

Proof:

Let 𝛌=𝐚ₕ⋅𝐱, and 𝛌 satisfy g(𝛌) = 𝐲. Normalizing 𝐲 to (0,1] by u(𝐲) to let 𝛌∈ 𝐑.
Thus, for a sine hidden node, 𝛌 = g⁻¹(u(𝐲)) = arcsin(u(𝐲)). While for a sigmoid hidden node, 𝛌 = g⁻¹(u(𝐲)) = -log(1/u(𝐲) - 1).
^𝐚ₕ is the solution for the linear system (g(𝐚ₕ⋅𝐱)=𝐲).
For sine activation: ^𝐚ₕ = g⁻¹( u(𝐲) )⋅𝐱⁻¹ = arcsin(u(𝐲))⋅𝐱⁻¹. For sigmoid activation: ^𝐚ₕ = g⁻¹( u(𝐲) )⋅𝐱⁻¹ = -log(1/u(𝐲) - 1)⋅𝐱⁻¹
One of the least-squares solutions of a general linear system 𝐚ₕ⋅𝐱=𝛌 is ^𝐚ₕ = g⁻¹( u(𝐲) )⋅𝐱⁻¹, which means the smallest error can be reached by this solution: ‖^𝐚ₕ⋅𝐱 -𝛌ₙ‖ = min_{𝐚ₕ∈ 𝐑ᵐᕁⁿ} ‖𝐚ₕ⋅𝐱 - g⁻¹( u(𝐲) )‖ (18)
The special solution ^𝐚ₕ = g⁻¹( u(𝐲) )⋅𝐱⁻¹ has the smallest norm among all the least-squares solutions of 𝐚ₕ⋅𝐱 = 𝛌. The error can be further reduced by adding bias bₙ: ^bₕ = √mse(^𝐚ₕ⋅𝐱 - h⁻¹( u(𝐲) ))
Based on eq. (18) and Lemma2, optimization by minimizing the L2-loss can be reformulated as: min_{𝐚ₕ∈ 𝐑ᵐᕁⁿ} ‖u⁻¹( g(𝐚ₕ⋅𝐱) ) - u⁻¹( g(𝛌))‖ = ‖u⁻¹( g(^𝐚ₕ⋅𝐱) ) - u⁻¹( g(𝛌))‖ ≥ ‖u⁻¹( g(^𝐚ₕ⋅𝐱 + ^bₕ) ) - 𝐲‖ (20)
Based on eq. (18) and eq. (20), the optimal weights is proved as: ^aₕ = arg min_{𝐚ₕ∈ 𝐑ᵐᕁⁿ} ‖g(𝐱,𝐚ₕ) - 𝐲‖
And it satisfy: ‖g(𝐱,^𝐚ₕ,^bₕ) - 𝐲‖ ≤ min_{𝐚ₕ∈ 𝐑ᵐᕁⁿ} ‖u⁻¹( g(^𝐚ₕ⋅𝐱) ) - 𝐲 ‖

Based on Lemma 2 and Theorem 1, Theorem 2 is given:

Theorem 2: Given M arbitrary distinct samples (𝐱, 𝐲), 𝐱∈ 𝐑ⁿᕁᴹ, 𝐲∈ 𝐑ᵐᕁᴹ, a sigmoid or sine activation function g, and the initial orthogonal random weights ^𝐚¹_f and bias ^b¹_f. For any continuous desired output 𝐲, the optimal feature data is:
𝐇⃰ᴸ⃰ _f(𝐱, (^𝐚¹_f, …, ^𝐚ᴸ_f), (^b¹_f,…,^bᴸ_f)) = ∑ⱼ₌₁ᴸ uⱼ⁻¹ g(^𝐚ʲ_f ⋅ 𝐱 + ^bʲ_f) which satisfy:
‖𝐞_L‖ ≤ min_{^𝐚ʲ_f∈ 𝐑ⁿᕁᵈ} ( min_{𝐚ₕ∈ 𝐑 ᵐᕁᵈ} ‖𝐲-uₙ⁻¹ g(𝐇⃰ᴸ_f, 𝐚ₕ, bₕ)‖) (21)

and ‖𝐞_L‖ is decreasing and bounded below by zero if these parameters are obtained by:

𝐇⃰ʲ_f = ∑ᵢ₌₁ʲ uᵢ⁻¹ g(𝐱, ^𝐚ⁱ_f, ^bⁱ_f),

^𝐚ₕ = g⁻¹(uₙ(𝐲)) ⋅ (𝐇⃰ʲ_f)⁻¹, ^𝐚ₕ∈ 𝐑 ᵐᕁᵈ,

^bₕ = √mse(^𝐚ₕ⋅𝐇⃰ʲ_f - g⁻¹( u(𝐲) )), ^bₕ∈ 𝐑

g⁻¹(⋅) = \{^{arcsin(⋅) \quad if\ g(⋅)=sin(⋅)}_{-log(1/(⋅)-1) \quad if\ g(⋅) = 1/(1+e⁻⁽˙⁾)}$, _

𝐞ⱼ = 𝐲 - uₙ⁻¹( g(𝐇⃰ʲ_f, ^𝐚ₕ, ^bₕ), 𝐏ⱼ = g⁻¹(uₙ(𝐞ⱼ))⋅(^𝐚ₕ)⁻¹ ),

^𝐚ʲ_f = g⁻¹(uⱼ(𝐏ⱼ₋₁)) ⋅ 𝐱⁻¹, ^𝐚ʲ∈ 𝐑ⁿᕁᵈ

^bʲ_f = √mse(^𝐚ʲ_f ⋅ 𝐱 - 𝐏ⱼ₋₁), ^bʲ_f∈ 𝐑

Proof:

Base on Theorem 1, the validity of (21) is obvious. So here, we just prove that the error ‖𝐞_L‖ is decreasing and bounded below by zero.

Let Δ = ‖eⱼ₋₁‖² - ‖𝐲 - uₙ⁻¹g(𝐇⃰ʲ_f, ^𝐚ₕ, ^bₕ)‖² (last error-current output), and take the newest item apart:
= ‖eⱼ₋₁‖² - ‖𝐲 - uₙ⁻¹g( (∑ᵢ₌₁ʲ⁻¹ uᵢ⁻¹ g(𝐱, ^𝐚ʲ_f, ^bʲ_f) + uᵢ⁻¹g(𝐱, ^𝐚ʲ_f, ^bʲ_f) ), ^𝐚ₕ, ^bₕ) ‖² (24)
Let ^Tʲ = uₙ⁻¹g(uⱼ⁻¹g(𝐱, ^𝐚ʲ_f, ^bʲ_f), ^𝐚ₕ, ^bₕ). Because activation function is sigmoid or sine function, eq. (24) can be simplified as:
Δ ≥ ‖𝐞ⱼ₋₁‖² - ‖𝐲 - uₙ⁻¹g( (∑ᵢ₌₁ʲ⁻¹ uᵢ⁻¹ g(𝐱, ^𝐚ʲ_f, ^bʲ_f) ), ^𝐚ₕ, ^bₕ) - ^Tʲ‖²
= ‖𝐞ⱼ₋₁‖² - ‖𝐞ⱼ₋₁ - ^Tʲ‖² (unfold)
= ‖𝐞ⱼ₋₁‖² - (‖𝐞ⱼ₋₁‖² - 2<eⱼ₋₁, ‖^Tʲ‖> + ‖^Tʲ‖²) ("<>" is dot product of 2 matrices: Frobenius inner product)
= 2<𝐞ⱼ₋₁, ‖^Tʲ‖> - ‖^Tʲ‖²
= ‖^Tʲ‖² ( 2<𝐞ⱼ₋₁, ‖^Tʲ‖> / ‖^Tʲ‖² - 1 ) (25)
We set ^Tʲ = uₙ⁻¹g(uⱼ⁻¹g( 𝐱, ^𝐚ʲ_f, ^bʲ_f) ), ^𝐚ₕ, ^bₕ ) = 𝐞ⱼ₋₁ ± σ. (σ is variance, and 𝐞ⱼ₋₁ is the expectation)
So 𝐞ⱼ₋₁ = ^Tʲ ± σ.
Then <𝐞ⱼ₋₁, ‖^Tʲ‖> = <^Tʲ± σ, ‖^Tʲ‖> = <‖^Tʲ‖² ± <‖^Tʲ‖,σ> >

Hence, eq. (25) can be reformulated:
Δ ≥ ‖^Tʲ‖² ( 2< ‖^Tʲ‖² ± <‖^Tʲ‖,σ> > / ‖^Tʲ‖² - 1 )
= ‖^Tʲ‖² ( 1 ± 2‖σ⋅(^Tʲ)ᵀ‖/‖^Tʲ‖²) (Wandong thinks there should be a 2.)
≥
In addition, based on Theorem 1 and eq. (7), there will be:
- ‖^Tʲ - 𝐞ⱼ₋₁‖ ≤ min_{^𝐚ʲ_f∈ 𝐑ᵈᕁⁿ} ‖uₙ⁻¹g(uⱼ⁻¹g( 𝐱, ^𝐚ʲ_f, ^bʲ_f), ^𝐚ₕ, ^bₕ) -𝐞ⱼ₋₁‖
- ‖σ‖ ≤ ‖^Tʲ‖
Thus Δ ≥ 0 can be proved as: Δ ≥ ‖^Tʲ‖² (1 ± ‖σ‖ / ‖^Tʲ‖) ≥ 0 (28)

Eq. (28) means ‖𝐞ⱼ₋₁‖ ≥ ‖𝐞ⱼ‖ and ‖𝐞‖ is decreasing and bounded below by zero.

Based on Theorem 2, Theorem 3 is given:

Theorem 3: Given M arbitrary distinct samples (𝐱, 𝐲), 𝐱∈ 𝐑ⁿᕁᴹ, 𝐲∈ 𝐑ᵐᕁᴹ, a sigmoid or sine activation function g, and optimal feature data 𝐇⃰ᴸ_f obtained by Algorithm 1,
then lim_{j➝+∞} ‖𝐲 - β₁⋅u₁⁻¹g(𝐇⃰ᴸ_f, 𝐚₁, 𝑏₁) - … - βⱼ⋅uⱼ⁻¹g(𝐇⃰ᴸ_f, 𝐚ⱼ, 𝑏ⱼ)‖ = 0 holds with probability one if :

𝐚ⱼ = g⁻¹( u(𝐲) ) ⋅ (𝐇⃰ᴸ_f)⁻¹, ^𝐚ⱼ∈ 𝐑ᵐᕁⁿ

bⱼ = √mse(^𝐚ⱼ⋅(𝐇⃰ᴸ_f) - g⁻¹(u(𝐲))), ^bⱼ∈ 𝐑

βⱼ = ⟨𝐞ⱼ₋₁, g(𝐇⃰ᴸ_f, 𝐚ⱼ, bⱼ)⟩ / ‖g(𝐇⃰ᴸ_f, 𝐚ⱼ, bⱼ)‖², βⱼ∈ 𝐑

Proof:

First prove that the sequence ‖𝐞ⱼᴸ‖ is decreasing and bounded below by zero. Then prove that the lim_{j➝+∞} ‖𝐞ⱼᴸ‖ = 0

Based on Theorem 1 and Lemma 1, the network output error satisfies: ‖𝐞ⱼᴸ‖ = ‖𝐲 - β₁⋅u₁⁻¹g(𝐇⃰ᴸ_f, 𝐚₁, 𝑏₁) - … - βⱼ⋅uⱼ⁻¹g(𝐇⃰ᴸ_f, 𝐚ⱼ, 𝑏ⱼ)‖
≤ ‖𝐲 -u₁⁻¹g(𝐇⃰ᴸ_f, 𝐚₁, 𝑏₁)‖
= ‖𝐞₁ᴸ‖
Based on Theorem 2, there will be:
‖𝐞ⱼᴸ‖ ≤ ‖𝐞ⱼᴸ⁻¹‖ ≤ … ≤ ‖𝐞ⱼ¹‖
Thus, ‖𝐞ⱼᴸ‖ ≤ ‖𝐞ⱼᴸ⁻¹‖ ≤ … ≤ ‖𝐞ⱼ¹‖ ≤ … ≤ ‖𝐞₁¹‖ and ‖𝐞ⱼᴸ‖ is decreasing and bounded below by 0.
Based on Lemma 1, when all hidden nodes randomly generated based on any continuous sampling distribution, lim_{n➝∞} ‖f - (fₙ₋₁ + βₙ⋅g(𝐱, 𝛂ₙʳ, ♭ₙʳ) )‖ = 0 holds with probability one if βₙ = ⟨𝐞ₙ₋₁, g(𝐇⃰ᴸⱼ, 𝛂ₙʳ, ♭ₙʳ)⟩ / ‖g(𝐇⃰ᴸⱼ, 𝛂ₙʳ, ♭ₙʳ)‖².
In addition, ELM theories have shown that almost any nonlinear piecewise continuous random hidden node can be use in ELM, and the resultant networks have universal approximation capbilities.
According to the definition of general hidden neurons, (a general hidden node contains m (basic) hidden node), a general hidden node (𝐚,b) = (𝛂ʳ₁, …, 𝛂ʳₘ, bʳ₁, …, bʳₘ), .
Thus its output is g(𝐇⃰ᴸ_f, 𝐚ⱼʳ, ♭ⱼʳ) ≡ ∑ᵢ₌₁ᵐ g(𝐇⃰ᴸ_f, 𝛂ʳᵢ, bʳᵢ).

Therefore, lim_{j➝∞} ‖ 𝐲 - β₁⋅u₁⁻¹g(𝐇⃰ᴸ_f, 𝐚₁, 𝑏₁) - … - βⱼ⋅uⱼ⁻¹g(𝐇⃰ᴸ_f, 𝐚ⱼ, 𝑏ⱼ)‖
= lim_{n➝∞} ‖f- (fₙ₋₁ + βⱼ⋅uⱼ⁻¹g(𝐇⃰ᴸ_f, 𝐚ⱼʳ, 𝑏ⱼʳ)) ‖ = 0

D. Proposed Method With Multinetwork Structures

%%{ init: { 'flowchart': { 'curve': 'basis' } } }%% flowchart LR subgraph in["input feature"] x1((x1)) & xe(("⋮")) & xn((xn)) end subgraph net1["Layer 1"] l11((a,b)) & l1e(("⋮")) & l1L((a,b)) end subgraph net2["Layer 2"] l21((a,b)) & l2e(("⋮")) & l2L((a,b)) end subgraph netC["Layer C"] lC1((a,b)) & lCe(("⋮")) & lCL((a,b)) end x1 & xn --> l11 & l1L l11 & l1L --> l21 & l2L l21 & l2L -.-> lC1 & lCL subgraph out["output y"] direction LR y1((1)) & ye(("⋮")) & ym((m)) end subgraph belm1["Basic ELM 1"] direction LR b11(("aₕ,bₕ")) & b1e(("⋮")) & b1m((aₕ,bₕ)) end subgraph belm2["Basic ELM 2"] direction LR b21((aₕ,bₕ)) & b2e(("⋮")) & b2m((aₕ,bₕ)) end net1 --> feat1["Feature\n data\n 𝐇¹_f"] --> belm1 --> out feat1 --> net2 --> feat2["Feature\n data\n 𝐇²_f"] --> belm2 --> out netC --> featC["Feature\n data\n 𝐇ᶜ_f"] subgraph MultiLayer ELM in & net1 & net2 & netC end %% inverse for initialization weights linkStyle 12,13,14,16,17,18 stroke:#f0f

Pink links will do inverse to calculate the weights for corresponding layers.