read: Width-Growth Model with Subnetwork Nodes

Authors: Wandong Zhang et. al. Publish date: 2020-03-30 (Finished in 2019)

IEEE Trans. Industrial Informatics | G.Drive | G.Scholar

Try to summary (2023-02-26):

Different features are concatenated and then fed into a “I-ELM with subnetwork nodes”.
What is optimized it the combination weights, but the feature vectors themselves are not changed.
It is the weights (IW, 𝛃) are refined.
Sepecifically, the new R-SNN node is improved by adding a part of unlearned wights accquired from the residual error of the last node. That is the weights are accumulated on the newest node. So the ultimate R-SNN node contains all the previous training outcomes. What we kept is only the last R-SNN node, i.e., a SLFN.
Does that require the performance of the final R-SNN is the best among the former nodes.

(In code) The update process of a SLFN is as follows:

flowchart LR subgraph node1[SLFN1] h1((h1)) & h2((h2)) & he((he)) & hd((hd)) end In["X\n data\n matrix"] --> IW1 --> h1 & h2 & he & hd node1 --- beta1("𝛃1") --> Yout1 --> Error1 Target -->|"pinv"| beta1 beta1 & Error1 -->|"inv: 𝛃⋅P=𝐞₁"| P["P\n ( The H\n yielding\n Error1) "] P & In --> IW'["residual\n IW"] IW' & IW1 -->sum((+)) --> IW2 subgraph node2[SLFN2] h21((h1)) & h22((h2)) & h2e((he)) & h2d((hd)) end IW2 --> h21 & h22 & h2e & h2d node2 --- beta2("𝛃2") Target --> |"pinv"| beta2 beta2 --> Yout2

Abstract

A supervised multi-layer subnetwork-based feature refinement and classification model for representation learning.
Expand the width for a generalized hidden layer rather than stack more layers to go deeper
One-shot solution for finding the meaningful latent space to recognize the objects rather than searching separate spaces to find a generalized feature space.
Multimodal fusion fusing various feature sources into a superstate encoding instead of a unimodal feature coding in the traditional feature representation methods.

Ⅰ. Introduction

(Task & Application & List of ralated research field & Problem & Existing solutions brif)

Task: high-dimensional data processing and learning
Problem definition: selecting the optimal feature descriptors
2 branch of solutions: hand-crafted descriptors and deep-learning-based features.

(Criticize the former feature extraction solutions and introduce proposed method:)

Features derived from approaches of those 2 categories are too inflexible to contribute a robust model.
This method “encodes and refines these? raw features from multiple sources to improve the classification performance”.
For example: 4 extracted features (from AlexNet, ResNet, HMP, and SPF) are concatenated into 1 vector taken as the input to a “3-layer” model, where only a single “genearlized” hidden layer (latent space) bridges the raw feature space (transformation ax+b) and the final target space (residual error).

(Recap deep learning models and mention the theory base of this work)

Deep networks often get “trapped in local minimum and are sensitive to the learning rate” because their training fundation is BP.
Regression-based feature learning. Least-squares representation learning methods.

(Problems to be solved)
Drawbacks of regression-based approaches:

“block” models? don’t perform one-shot training philosophy based on the relation between raw data and the target.
A model trained by some “designed” process has a inferious generalizatio n capacity than the model derived from one-shot training strategy (least-squares).

Drawbacks of multilayer neural networks & solution

Deeper layer-stacked models suffer from overfitting with limited training samples.
Network-in-network structure enhances the network’s generalization capacity for learning feature. ELM with subnetwork nodes.
Contributions:
- Subnetwork neural nodes (SNN) realized multilayer representation learning. Unlike the ensembled network, the SNN is trained based on the error term.
- Feature space transformation and the classification are solved together by searching iteratively the optimal encoding space (hidden layer).
- Concatenation of multiple features result more discriminative representations for samples.

Ⅱ. Literature review

A. Conventional Feature Coding

" Supervised method of learning representaiton evaluates the importance of a specific feature through the correlation between features and categories."

Conventional feature coding of images depends on prior knowledge of the problem. Thus, the features are not complete representations.

This paper enhances the feature by fusing (discriminative) hand-crafted features and (class-specific) CNN-based features.

B. Least-Squares Encoding Methods

The least-squares approximation methods, such as random forest and alternating minimization, have been exhaustively investigated in single-layer neural networks.

Related works: Moore-Penrose inverse; Universal approximation capacity of I-ELM, ELM autoe-ncoder14, Features combined with subnetwork nodes 18

Each SNN is applied as a local feature descriptor. Hence, the subspace features can be extracted? from the original data independently, and the useful features are generated via the combination of these features.

Ⅲ. Proposed Method

A. Algorithmic Summary

Two steps:

Preprocessing: concatenate various feature vectors into a single “supervector”.

Train the width-growth model:
Terminology:

layer	name	marker	params	in	out
input	Entrance (feature) layer	𝑓	𝐖ᵢᶠ, 𝐛ᵢᶠ random	vct	linear combination 𝐇
hidden	Refinement layer/subspace	𝑟	𝐖ᵢʳ, 𝐛ᵢʳ (𝐚,b)	𝐇	partial feature Ψ
output	Least square learning layr	𝑣	𝐖ᵢᵛ (𝛃)	Ψ	sum up all partial features: 𝚪 residual error 𝐞

(An entrance layer and a refinement layer both are “SNN”, and their combination is a “R-SNN”)

Initialization: For the 1st R-SNN, 𝐖₁ᶠ, 𝐖₁ʳ are random generating a false feature Ψ.
Then the first least-square method (pseudoinverse) is performed to calculate 𝐖₁ᵛ based on target 𝐘 and Ψ.
Iteratively add the R-SNN (2≤ i≤ L) (refinement subspace) into the hidden layer (optimal feature space)

flowchart TB subgraph In[input feature] x1((1)) & x2((2)) & xe(("⋮")) & xn((n)) end EnW("Entrance layer\n 𝐖ᵢᶠ, 𝐛ᵢᶠ\n random") subgraph H["entrance feature 𝐇"] h1((1)) & h2((2)) & he(("⋮")) & hD((D)) end RefineW("Refinement layer\n 𝐖ᵢʳ, 𝐛ᵢʳ") subgraph Psi[partial feature Ψ] Ψ1((1)) & Ψ2((2)) & Ψe(("⋮")) & Ψd((d)) end OW("Output layer\n 𝐖ᵢᵛ") subgraph Out["Output vector"] o1((1)) & o2((2)) & oe(("⋮")) & om((m)) end x1 & x2 & xe & xn --> EnW --> h1 & h2 & he & hD --> RefineW --> Ψ1 & Ψ2 & Ψe & Ψd --> OW --> o1 & o2 & oe & om Out -->|"- 𝐞ᵢ₋₁"| erri["𝐞ᵢ"] erri & OW -.->|pinv| newΨ("𝐏 \n yielding\n 𝐞ᵢ") subgraph H1["entrace feature 𝐇ᵢ₊₁"] h11((1)) & h12((2)) & h1e(("⋮")) & h1D((D)) end In --> EnW1("Entrance layer\n 𝐖ᵢ₊₁ᶠ, 𝐛ᵢ₊₁ᶠ\n random") --> h11 & h12 & h1e & h1D H1 --> RefineW1("Refinement layer\n 𝐖ᵢ₊₁ʳ, 𝐛ᵢ₊₁ʳ") %%-.-|solved by P| newΨ newΨ -.-> RefineW1 subgraph Psi1[partial feature Ψ] Ψ11((1)) & Ψ12((2)) & Ψ1e(("⋮")) & Ψ1d((d)) end RefineW1 --> Ψ11 & Ψ12 & Ψ1e & Ψ1d --> OW1("Output layer\n 𝐖ᵢ₊₁ᵛ") %%OW1 -.-|solved by| erri erri -.-> OW1 subgraph Out1["Output vector"] o11((1)) & o12((2)) & o1e(("⋮")) & o1m((m)) end OW1 --> o11 & o12 & o1e & o1m Out1 -->|"- 𝐞ᵢ"| erri+1["𝐞ᵢ₊₁"] --> newP

B. Model Definition

SLFN solves the regression problem can be expressed as:
MLNN has nested transformation:
Proposed method is a generlized SLFN:

minimize J = ½ ‖𝐘-f(𝐇ᵢᶠ, 𝐖ᵢʳ, 𝐛ᵢʳ)⋅𝐖_Lᵛ‖²,
- f(𝐇ᵢᶠ, 𝐖ᵢʳ, 𝐛ᵢʳ) = ∑ᵢ₌₁ᴸ g(𝐇ᵢᶠ ⋅ 𝐖ᵢʳ + 𝐛ᵢʳ): sum all R-SNN
- 𝐇ᵢᶠ = g(𝐖ᵢᶠ, 𝐛ᵢᶠ, 𝐗)
- 𝐘 ∈ ℝᴺᕽᵐ: expected output, target feature
- 𝐗 ∈ ℝᴺᕽⁿ: input matrix
- L : number of R-SNN node
- g : activateion function
3 differences from other least-squares-based MLNNs
1. SNN combines each dimension of the feature vector serving as local feature descriptor. While the R-SNN is the basic unit to refine feature vectors.
2. Optimal feature is the aggregation of R-SNN added one by one. R-SNN is densly connected to input vector and output layer containing twice linear projection. Different R-SNNs are independent because they learn from different error.
3. The latent space is the aggregation of all R-SNN nodes subspace. So the parameters training has no block-wise communication between different spaces. That means the feature refinement and classification are doen together.

C. Proposed Width-Growth Model

Input weights and bias 𝐖ᵢᶠ, 𝐛ᵢᶠ: randomly initialized;
Entrance feature: 𝐇ᵢᶠ = g(𝐗𝐖ᵢᶠ+ 𝐛ᵢᶠ);
Refined partial feature: Ψᵢ=g(𝐇ᵢᶠ𝐖ᵢʳ+ 𝐛ᵢʳ), where 𝐛ᵢʳ is random;
Output weights: 𝐖ᵢᵛ=(𝐈/C + 𝚪ᵀ𝚪)⁻¹𝚪ᵀ⋅𝐘,
where C is hyperparameter for regularization, and (𝐈/C + 𝚪ᵀ𝚪)⁻¹𝚪ᵀ is the pseudoinverse of output vector 𝚪 (label?)
Error: 𝐞ᵢ = 𝐘 - 𝐖ᵢᵛ
𝐏 is the desired matrix generating 𝐞ᵢ by: 𝐏ᵢ⋅𝐖ᵢᵛ=𝐞ᵢ, so
𝐏ᵢ = 𝐞ᵢ⋅(I/C + (𝐖ᵢᵛ)ᵀ𝐖ᵢᵛ)⁻¹(𝐖ᵢᵛ)ᵀ
Refinement layer weights of next R-SNN:
𝐖ᵢ₊₁ʳ = (I/C +𝐇ᵢᵀ𝐇ᵢ)⁻¹𝐇ᵢᵀ ⋅ g⁻¹(𝐏ᵢ),
because g(𝐇ᵢ₊₁⋅𝐖ᵢ₊₁ʳ+ 𝐛ᵢ₊₁ʳ) = 𝐏ᵢ.
Next partial feature: Ψᵢ₊₁ = g(𝐇ᵢ₊₁⋅𝐖ᵢ₊₁ʳ+ 𝐛ᵢ₊₁ʳ)
Accumulate the partial feature to the optimal feature: 𝚪ᵢ₊₁ = 𝚪ᵢ + Ψᵢ₊₁
Update error: 𝐞ᵢ₊₁ = 𝐞ᵢ-𝐖ᵢᵛ𝚪ᵢ

Repeat steps 4-6 L-2 times, and the final feature 𝚪$_L$ is the generalized feature correponding to the best output parameter 𝐖 $_Lᵛ$ for classification.