ADMEWORKS ModelBuilder - Feature Selection - Iterative SVMFS
This command displays the dialog for the SVM-based feature selection. This method is an extension of the popular SVM-based feature selection method called “Recursive feature elimination (RFE)”. This method builds the features ranking. Features are ranked using criteria obtained from the SVM classifier. The original RFE algorithm is based on a greedy strategy: it eliminates one feature at each iteration. There are also extensions, which eliminate chunks of parameters with the smallest value of a criterion (for example: the square root of remaining parameters). Entropy-based elimination is an extension of the “Entropy-based RFE” method. This method uses the entropy of parameter ranking to decide how many features should be removed. It also needs a “mean threshold” parameter to control the size of eliminated parameters set at each iteration (the higher the “mean threshold” parameter, the larger the number of potentially removed descriptors). The entropy based elimination is very fast in cases when there are very different criterion values, and the majority of them are high (the mean of criterion values is higher than the “mean threshold” parameter).
The iterative SVMFS method is very effective if the training set has no more than 500 samples, because the complexity of this method depends on the number of support vectors. The best parameters for the SVM classifier should be set using experiments with cross-validation for their different values (C and kernel parameters).
There is no answer to the question which criterion is the best for a given training set. Experimental results show that ||w||^2, R^2||w||^2 criteria and gradients of ||w||^2, R^2||w||^2 and span estimate, give very good results for many datasets. All of them are upper bounds LOO (leave-one-out) error. All of them choose the optimal subset of descriptors with respect to LOO error. In fact, there are three criteria (||w||^2, span estimate, R^2||w||^2) and three gradients of these criterions. The gradient of a criterion measures the sensitivity of the feature elimination process in respect to this criterion.
The detailed explanation of all the criteria:
||w||^2 measures the size of the margin which divides the training set into two classes. The larger the margin, the smaller the generalization error (samples are well separated).
grad ||w||^2 measures the sensitivity of the features elimination process with respect to ||w||^2.
Span estimate measures the value of “span” between support vectors.
Span est. gradient measures the sensitivity of features elimination process with respect to span estimate criterion.
R^2||w||^2 is a radius/margin criterion. “R” is the radius of the smallest sphere, which contained all samples from a training set. “||w||^2” is the size of margin described above. Calculations of this criterion are quite expensive, because of the difficulty in solving the optimization problem of the “R” finding.
grad R^2||w||^2 measures the sensitivity of the features elimination process with respect to R^2||w||^2 criterion.