When dimensionality increases, data becomes increasingly sparse
Concepts become less meaningful: density and distance
Subspace combinations grow very fast
Dimentionality Reduction
Eliminate irrelevant features and reduces noise
$X$ is a set of $N$ features: $X=\{X_1, X_2, \cdots X_N\}$,a reduced set $X'$ is a transformation of $X$ and consists of $d$ features so that $d
$$
X' = T(X) = \{ X_1',\ X_2',\ \cdots,\ X_d'\} \\
T: \R^N \rightarrow \R^d,\ d
Avoids the curse of dimensionality. Reduces time and space required for computations.>
Two ways:
Feature Extraction: transformation to a lower dimension
Wavelet transforms and PCA
Feature Selection: transformation is limited to only selection from original features
Filters, Wrappers, Embedded.
3 Features
Relevant feature is neither irrelevant nor redundant to the target concept.
Irrelevant feature is useless information for the learning task, and may causes greater computational cost and overfitting
Redundant feature is duplication of information that has contained in other features.
Feature Selection
Assume a binary classification model, X → Model → Y ∈ {0, 1}, where X consists of
N different features, e.g., age, weight, temperature, blood pressure, etc.
X = {X1, X2, X3 , . . . , XN }
N could be small, or relatively large value, e.g., an image of size of 300 × 300.
Class Separation Criterion
Evaluation of data separation result based on selected features.
Filter Methods: statistical analysis without using predictive model
statistical dependence, information gain, Chi square, log likelihood ratio, interclass distance or information-theoretic measures
Wrapper Methods: pre-determined predictive models or classifiers
Hybrid Methods: complement of wrapper and filter approaches
Filters Approaches
select d features greedy which are used for training a predictive model $h_M$ with M samples
Evaluation is independent of the predictive models or classifiers.
Objective function evaluate the information content and statistical measures of feature subsets
Role: evaluate each feature individually or a batch of features
Major steps:
Evaluating and ranking features
Choosing the features with the highest ranks to induce models
Advantages:
Fast Execution: non-iterative computation is faster than training session of predictive models
Generality: evaluate intrinsic properties of the data, rather than their interactions with a particular predictive model. So the final subset is general for any subsequent predictive models
Disadvantages:
Tendency to select large subsets: more features will make the monotonic objective functions larger
Independence: ignore the performance on predictive models
Wrapper Approches
a predictive model $h_M$ is trained to find the best subset $S^*$
一个预测模型被包在了“选择系统”里面
Maximize the separation criterion J or minimize the estimiated error$\epsilon$