read: Width-Growth Model with Subnetwork Nodes

A Width-Growth Model With Subnetwork Nodes and Refinement Structure for Representation Learning and Image Classification (TII 2020)

Authors: Wandong Zhang et. al. Publish date: 2020-03-30 (Finished in 2019)

IEEE Trans. Industrial Informatics | G.Drive | G.Scholar

Try to summary (2023-02-26):

  • Different features are concatenated and then fed into a “I-ELM with subnetwork nodes”.
  • What is optimized it the combination weights, but the feature vectors themselves are not changed.
  • It is the weights (IW, š›ƒ) are refined.
  • Sepecifically, the new R-SNN node is improved by adding a part of unlearned wights accquired from the residual error of the last node. That is the weights are accumulated on the newest node. So the ultimate R-SNN node contains all the previous training outcomes. What we kept is only the last R-SNN node, i.e., a SLFN.
  • Does that require the performance of the final R-SNN is the best among the former nodes.

(In code) The update process of a SLFN is as follows:

flowchart LR subgraph node1[SLFN1] h1((h1)) & h2((h2)) & he((he)) & hd((hd)) end In["X\n data\n matrix"] --> IW1 --> h1 & h2 & he & hd node1 --- beta1("š›ƒ1") --> Yout1 --> Error1 Target -->|"pinv"| beta1 beta1 & Error1 -->|"inv: š›ƒā‹…P=šžā‚"| P["P\n ( The H\n yielding\n Error1) "] P & In --> IW'["residual\n IW"] IW' & IW1 -->sum((+)) --> IW2 subgraph node2[SLFN2] h21((h1)) & h22((h2)) & h2e((he)) & h2d((hd)) end IW2 --> h21 & h22 & h2e & h2d node2 --- beta2("š›ƒ2") Target --> |"pinv"| beta2 beta2 --> Yout2

Abstract

  • A supervised multi-layer subnetwork-based feature refinement and classification model for representation learning.
  • Expand the width for a generalized hidden layer rather than stack more layers to go deeper
  • One-shot solution for finding the meaningful latent space to recognize the objects rather than searching separate spaces to find a generalized feature space.
  • Multimodal fusion fusing various feature sources into a superstate encoding instead of a unimodal feature coding in the traditional feature representation methods.

ā… . Introduction

(Task & Application & List of ralated research field & Problem & Existing solutions brif)

  • Task: high-dimensional data processing and learning
  • Problem definition: selecting the optimal feature descriptors
  • 2 branch of solutions: hand-crafted descriptors and deep-learning-based features.

(Criticize the former feature extraction solutions and introduce proposed method:)

  • Features derived from approaches of those 2 categories are too inflexible to contribute a robust model.
  • This method “encodes and refines these? raw features from multiple sources to improve the classification performance”.
    For example: 4 extracted features (from AlexNet, ResNet, HMP, and SPF) are concatenated into 1 vector taken as the input to a “3-layer” model, where only a single “genearlized” hidden layer (latent space) bridges the raw feature space (transformation ax+b) and the final target space (residual error).

(Recap deep learning models and mention the theory base of this work)

  • Deep networks often get “trapped in local minimum and are sensitive to the learning rate” because their training fundation is BP.
  • Regression-based feature learning. Least-squares representation learning methods.

(Problems to be solved)
Drawbacks of regression-based approaches:

  1. “block” models? don’t perform one-shot training philosophy based on the relation between raw data and the target.
  2. A model trained by some “designed” process has a inferious generalizatio n capacity than the model derived from one-shot training strategy (least-squares).

Drawbacks of multilayer neural networks & solution

  1. Deeper layer-stacked models suffer from overfitting with limited training samples.
  2. Network-in-network structure enhances the network’s generalization capacity for learning feature. ELM with subnetwork nodes.
  3. Contributions:
    • Subnetwork neural nodes (SNN) realized multilayer representation learning. Unlike the ensembled network, the SNN is trained based on the error term.
    • Feature space transformation and the classification are solved together by searching iteratively the optimal encoding space (hidden layer).
    • Concatenation of multiple features result more discriminative representations for samples.

ā…”. Literature review

A. Conventional Feature Coding

" Supervised method of learning representaiton evaluates the importance of a specific feature through the correlation between features and categories."

Conventional feature coding of images depends on prior knowledge of the problem. Thus, the features are not complete representations.

This paper enhances the feature by fusing (discriminative) hand-crafted features and (class-specific) CNN-based features.

B. Least-Squares Encoding Methods

The least-squares approximation methods, such as random forest and alternating minimization, have been exhaustively investigated in single-layer neural networks.

Related works: Moore-Penrose inverse; Universal approximation capacity of I-ELM, ELM autoe-ncoder14, Features combined with subnetwork nodes 18

Each SNN is applied as a local feature descriptor. Hence, the subspace features can be extracted? from the original data independently, and the useful features are generated via the combination of these features.

ā…¢. Proposed Method

A. Algorithmic Summary

Two steps:

  1. Preprocessing: concatenate various feature vectors into a single “supervector”.
  2. Train the width-growth model:
    Terminology:
    layer name marker params in out
    input Entrance (feature) layer š‘“ š–įµ¢į¶ , š›įµ¢į¶  random vct linear combination š‡
    hidden Refinement layer/subspace š‘Ÿ š–įµ¢Ź³, š›įµ¢Ź³ (šš,b) š‡ partial feature ĪØ
    output Least square learning layr š‘£ š–įµ¢įµ› (š›ƒ) ĪØ sum up all partial features: ššŖ
    residual error šž

(An entrance layer and a refinement layer both are “SNN”, and their combination is a “R-SNN”)

  • Initialization: For the 1st R-SNN, š–ā‚į¶ , š–ā‚Ź³ are random generating a false feature ĪØ.
    Then the first least-square method (pseudoinverse) is performed to calculate š–ā‚įµ› based on target š˜ and ĪØ.

  • Iteratively add the R-SNN (2≤ i≤ L) (refinement subspace) into the hidden layer (optimal feature space)

    flowchart TB subgraph In[input feature] x1((1)) & x2((2)) & xe(("ā‹®")) & xn((n)) end EnW("Entrance layer\n š–įµ¢į¶ , š›įµ¢į¶ \n random") subgraph H["entrance feature š‡"] h1((1)) & h2((2)) & he(("ā‹®")) & hD((D)) end RefineW("Refinement layer\n š–įµ¢Ź³, š›įµ¢Ź³") subgraph Psi[partial feature ĪØ] ĪØ1((1)) & ĪØ2((2)) & ĪØe(("ā‹®")) & ĪØd((d)) end OW("Output layer\n š–įµ¢įµ›") subgraph Out["Output vector"] o1((1)) & o2((2)) & oe(("ā‹®")) & om((m)) end x1 & x2 & xe & xn --> EnW --> h1 & h2 & he & hD --> RefineW --> ĪØ1 & ĪØ2 & ĪØe & ĪØd --> OW --> o1 & o2 & oe & om Out -->|"- šžįµ¢ā‚‹ā‚"| erri["šžįµ¢"] erri & OW -.->|pinv| newĪØ("š \n yielding\n šžįµ¢") subgraph H1["entrace feature š‡įµ¢ā‚Šā‚"] h11((1)) & h12((2)) & h1e(("ā‹®")) & h1D((D)) end In --> EnW1("Entrance layer\n š–įµ¢ā‚Šā‚į¶ , š›įµ¢ā‚Šā‚į¶ \n random") --> h11 & h12 & h1e & h1D H1 --> RefineW1("Refinement layer\n š–įµ¢ā‚Šā‚Ź³, š›įµ¢ā‚Šā‚Ź³") %%-.-|solved by P| newĪØ newĪØ -.-> RefineW1 subgraph Psi1[partial feature ĪØ] ĪØ11((1)) & ĪØ12((2)) & ĪØ1e(("ā‹®")) & ĪØ1d((d)) end RefineW1 --> ĪØ11 & ĪØ12 & ĪØ1e & ĪØ1d --> OW1("Output layer\n š–įµ¢ā‚Šā‚įµ›") %%OW1 -.-|solved by| erri erri -.-> OW1 subgraph Out1["Output vector"] o11((1)) & o12((2)) & o1e(("ā‹®")) & o1m((m)) end OW1 --> o11 & o12 & o1e & o1m Out1 -->|"- šžįµ¢"| erri+1["šžįµ¢ā‚Šā‚"] --> newP

B. Model Definition

  • SLFN solves the regression problem can be expressed as:

  • MLNN has nested transformation:

  • Proposed method is a generlized SLFN:

    minimize J = ½ ā€–š˜-f(š‡įµ¢į¶ , š–įµ¢Ź³, š›įµ¢Ź³)ā‹…š–_Lᵛ‖²,

    • f(š‡įµ¢į¶ , š–įµ¢Ź³, š›įµ¢Ź³) = āˆ‘įµ¢ā‚Œā‚į“ø g(š‡įµ¢į¶  ā‹… š–įµ¢Ź³ + š›įµ¢Ź³): sum all R-SNN
    • š‡įµ¢į¶  = g(š–įµ¢į¶ , š›įµ¢į¶ , š—)
    • š˜ ∈ ā„į“ŗį•½įµ: expected output, target feature
    • š— ∈ ā„į“ŗį•½āæ: input matrix
    • L : number of R-SNN node
    • g : activateion function
  • 3 differences from other least-squares-based MLNNs

    1. SNN combines each dimension of the feature vector serving as local feature descriptor. While the R-SNN is the basic unit to refine feature vectors.

    2. Optimal feature is the aggregation of R-SNN added one by one. R-SNN is densly connected to input vector and output layer containing twice linear projection. Different R-SNNs are independent because they learn from different error.

    3. The latent space is the aggregation of all R-SNN nodes subspace. So the parameters training has no block-wise communication between different spaces. That means the feature refinement and classification are doen together.

C. Proposed Width-Growth Model

  1. Input weights and bias š–įµ¢į¶ , š›įµ¢į¶ : randomly initialized;
    Entrance feature: š‡įµ¢į¶  = g(š—š–įµ¢į¶ + š›įµ¢į¶ );
    Refined partial feature: ĪØįµ¢=g(š‡įµ¢į¶ š–įµ¢Ź³+ š›įµ¢Ź³), where š›įµ¢Ź³ is random;

  2. Output weights: š–įµ¢įµ›=(šˆ/C + ššŖįµ€ššŖ)ā»Ā¹ššŖįµ€ā‹…š˜,
    where C is hyperparameter for regularization, and (šˆ/C + ššŖįµ€ššŖ)ā»Ā¹ššŖįµ€ is the pseudoinverse of output vector ššŖ (label?)
    Error: šžįµ¢ = š˜ - š–įµ¢įµ›

  3. š is the desired matrix generating šžįµ¢ by: šįµ¢ā‹…š–įµ¢įµ›=šžįµ¢, so
    šįµ¢ = šžįµ¢ā‹…(I/C + (š–įµ¢įµ›)įµ€š–įµ¢įµ›)⁻¹(š–įµ¢įµ›)įµ€

  4. Refinement layer weights of next R-SNN:
    š–įµ¢ā‚Šā‚Ź³ = (I/C +š‡įµ¢įµ€š‡įµ¢)ā»Ā¹š‡įµ¢įµ€ ā‹… g⁻¹(šįµ¢),
    because g(š‡įµ¢ā‚Šā‚ā‹…š–įµ¢ā‚Šā‚Ź³+ š›įµ¢ā‚Šā‚Ź³) = šįµ¢.
    Next partial feature: ĪØįµ¢ā‚Šā‚ = g(š‡įµ¢ā‚Šā‚ā‹…š–įµ¢ā‚Šā‚Ź³+ š›įµ¢ā‚Šā‚Ź³)

  5. Accumulate the partial feature to the optimal feature: ššŖįµ¢ā‚Šā‚ = ššŖįµ¢ + ĪØįµ¢ā‚Šā‚

  6. Update error: šžįµ¢ā‚Šā‚ = šžįµ¢-š–įµ¢įµ›ššŖįµ¢

Repeat steps 4-6 L-2 times, and the final feature ššŖ$_L$ is the generalized feature correponding to the best output parameter š– $_Lįµ›$ for classification.

Ref