- Types(Filter methods, Wrapper methods, Embedded methods, Hybrid methods)-: Information gain/chi-squ/corr/MAD/stepwise/logistic/RF
- Genetic algorithm for feature selection
Feature selection is the process of reducing the number of input variables when developing a predictive model. Adding redundant variables reduces the generalization capability of the model and may also reduce the overall accuracy of a classifier. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.
The goal of feature selection in machine learning is to find the best set of features that allows one to build useful models of studied phenomena.
One way to think about feature selection methods is in terms of supervised and unsupervised methods.
Supervised Techniques: These techniques can be used for labeled data, and are used to identify the relevant features for increasing the efficiency of supervised models like classification and regression.
Unsupervised Techniques: These techniques can be used for unlabeled data.
We can summarize feature selection as follows.
A. Filter methods
B. Wrapper methods
C. Embedded methods
D. Hybrid methods
we simply compute the variance of each feature, and we select the subset of features based on a user-specified threshold. E.g., “keep all features that have a variance greater or equal to x” or “keep the top k features with the largest variance.” We assume that features with a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or feature and target variables into account, which is one of the drawbacks of filter methods.
- information gain
- chi-square test
- fisher score
- correlation coefficient
- variance threshold
- Mean Absolute Difference (MAD)
- Dispersion ratio
Information gain calculates the reduction in entropy from the transformation of a dataset. It can be used for feature selection by evaluating the Information gain of each variable in the context of the target variable.
This is a classification predictive modeling problem with categorical input variables. You can also use mutual information (information gain) from the field of information theory.
- Chi-Squared test (contingency tables).
- Mutual Information.
We calculate Chi-square between each feature and the target and select the desired number of features with the best Chi-square scores. the variables have to be categorical, sampled independently and values should have an expected frequency greater than 5.
Fisher score is one of the most widely used supervised feature selection methods. The algorithm which we will use returns the ranks of the variables based on the fisher’s score in descending order. We can then select the variables as per the case.
The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. The values range between -1.0 and 1.0. The logic behind using correlation for feature selection is that the good variables are highly correlated with the target. Furthermore, variables should be correlated with the target but should be uncorrelated among themselves.
If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only really needs one of them, as the second one does not add additional information. We will use the Pearson Correlation here.
We need to set an absolute value, say 0.5 as the threshold for selecting the variables. If we find that the predictor variables are correlated among themselves, we can drop the variable which has a lower correlation coefficient value with the target variable. We can also compute multiple correlation coefficients to check whether more than two variables are correlated to each other. This phenomenon is known as multicollinearity.
The variance threshold is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples. We assume that features with a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or feature and target variables into account, which is one of the drawbacks of filter methods.
The get_support returns a Boolean vector where True means that the variable does not have zero variance.
Mean Absolute Difference (MAD):
The mean absolute difference (MAD) computes the absolute difference from the mean value. The main difference between the variance and MAD measures is the absence of the square in the latter. The MAD, like the variance, is also a scale variant.’  This means that higher the MAD, higher the discriminatory power.
‘Another measure of dispersion applies the arithmetic mean (AM) and the geometric mean (GM). For a given (positive) feature Xi on n patterns, the AM and GM are given by
respectively; since AMi ≥ GMi, with equality holding if and only if Xi1 = Xi2 = …. = Xin, then the ratio
can be used as a dispersion measure. Higher dispersion implies a higher value of Ri, thus a more relevant feature. Conversely, when all the feature samples have (roughly) the same value, Ri is close to 1, indicating a low relevance feature.’
B. Wrapper Methods:
Wrappers require some method to search the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. The feature selection process is based on a specific machine learning algorithm that we are trying to fit on a given dataset. It follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. The wrapper methods usually result in better predictive accuracy than filter methods.
The scikit-learn library also provides many different filtering methods once statistics have been calculated for each input variable with the target.
Two of the more popular methods include:
Forward Feature Selection:
Forward elimination starts with no features, and the insertion of features into the regression model one-by-one. First, the regressor with the highest correlation is selected for inclusion, which coincidentally the regressor that produces the largest F-statistic value when testing the significance of the model. This is an iterative method wherein we start with the best performing variable against the target. Next, we select another variable that gives the best performance in combination with the first selected variable. This process continues until the preset criterion is achieved.
Backward Feature Elimination:
This method works exactly opposite to the Forward Feature Selection method. Here, we start with all the features available and build a model. Next, we the variable from the model which gives the best evaluation measure value. This process is continued until the preset criterion is achieved.
This method along with the one discussed above is also known as the Sequential Feature Selection method.
Stepwise elimination is a hybrid of forward and backward elimination and starts similarly to the forward elimination method, e.g. with no regressors. Features are then selected as described in forward feature selection, but after each step, regressors are checked for elimination as per backward elimination. The hope is that as we enter new variables that are better at explaining the dependent variable, variables already included may become redundant.
Performs a forward feature selection based on p-value from statsmodels.api.OLS Arguments: X — pandas.DataFrame with candidate features y — list-like with the target threshold_in — include a feature if its p-value < threshold_in verbose in forward_regression and p-value > threshold_out in backword _regression. whether to print the sequence of inclusions and exclusions Returns: list of selected features
The results of forward feature selection are provided below. Note that the threshold was selected at 0.05 meaning that only variables lower than that threshold were selected. In this case 11 of 13 features. A more stringent criteria will eliminate more variables, although the 0.05 cutoff is already pretty stringent.
Exhaustive Feature Selection:
This is the most robust feature selection method covered so far. This is a brute-force evaluation of each feature subset. This means that it tries every possible combination of the variables and returns the best performing subset.
Recursive Feature Elimination:
‘Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute.
Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
C. Embedded Methods:
These methods encompass the benefits of both the wrapper and filter methods, by including interactions of features but also maintaining reasonable computational cost. Embedded methods are iterative in the sense that takes care of each iteration of the model training process and carefully extracts those features which contribute the most to the training for a particular iteration.
LASSO Regularization (L1):
Regularization consists of adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model, i.e. to avoid over-fitting. In linear model regularization, the penalty is applied over the coefficients that multiply each of the predictors. From the different types of regularization, Lasso or L1 has the property that is able to shrink some of the coefficients to zero. Therefore, that feature can be removed from the model.
Random Forest Importance:
Random Forests is a kind of a Bagging Algorithm that aggregates a specified number of decision trees. The tree-based strategies used by random forests naturally rank by how well they improve the purity of the node, or in other words a decrease in the impurity (Gini impurity) over all trees. Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features.
Logistic regression using statmodels.api
D. Hybrid methods
Genetic Algorithms (GA)
In machine learning, GA’s have two main uses. The first is for optimization, such as finding the best weights for a neural network. The second is for supervised feature selection. In this use case, “genes” represent individual features and the “organism” represents a candidate set of features. Each organism in the “population” is graded on a fitness score such as model performance on a hold-out set. The fittest organisms survive and reproduce, repeating until the population converges on a solution some generations later.
Let's train the model and predicting the accuracy using the Genetic Algorithm in the Logistics regression technique.
Here, in the above code, we saw how accuracy is improved after applying the genetic algorithm with logistic regression for better feature selection.
The next article will soon be on feature extraction techniques like Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis, etc. These methods help to reduce the dimensionality of the data or reduce the number of variables while preserving the variance of the data.
‘Feature Selection for Data and Pattern Recognition’ by Urszula Stańczyk and Lakhmi C. Jain
‘Efficient feature selection filters for high-dimensional data’ by Artur J. Ferreira , Mário A.T. Figueiredo