feature_selection | feature selection by using random forest | Data Mining library
kandi X-RAY | feature_selection Summary
kandi X-RAY | feature_selection Summary
use random forest to find important feature.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of feature_selection
feature_selection Key Features
feature_selection Examples and Code Snippets
Community Discussions
Trending Discussions on feature_selection
QUESTION
I want to do feature selection and I used Random forest classifier but did differently.
I used sklearn.feature_selection.SelectfromModel(estimator=randomforestclassifer...)
and used random forest classifier standalone. It was surprising to find that although I used the same classifier, the results were different. Except for some two features, all others were different. Can someone explain why is it so? Maybe is it because the parameters change in these two cases?
ANSWER
Answered 2021-Jun-06 at 17:10This could be because select_from_model
refits the estimator by default and sklearn.ensembe.RandomForestClassifier
has two pseudo random parameters: bootsrap
, which is set to True
by default, and max_features
, which is set to 'auto'
by default.
If you did not set a random_state
in your randomforestclassifier
estimator, then it will most likely yield different results every time you fit the model because of the randomness introduced by the bootstrap
and max_features
parameters, even on the same training data.
bootstrap=True
means that each tree will be trained on a random sample (with replacement) of a certain percentage of the observations from the training dataset.max_features='auto'
means that when building each node, only the square root of the number of features in your training data will be considered to pick the cutoff point that reduces the gini impurity most.
You can do two things to ensure you get the same results:
- Train your estimator first and then use
select_from_model(randomforestclassifier, refit=False)
. - Declare
randomforestclassifier
with a random seed and then useselect_from_model
.
Needless to say, both options require you to pass the same X
and y
data.
QUESTION
I was using tpotClassifier() and got the following pipeline as my optimal pipeline. I am attaching my pipeline code which I got. Can someone explain the pipeline processes and order?
...ANSWER
Answered 2021-May-20 at 14:28make_union
just unions multiple datasets, and FunctionTransformer(copy)
duplicates all the columns. So the nested make_union
and FunctionTransformer(copy)
makes several copies of each feature. That seems very odd, except that with ExtraTreesClassifier
it will have an effect of "bootstrapping" the feature selections. See also Issue 581 for an explanation for why these are generated in the first place; basically, adding copies is useful in stacking ensembles, and the genetic algorithm used by TPOT means it needs to generate those first before exploring such ensembles. There it is recommended that doing more iterations of the genetic algorithm may clean up such artifacts.
After that things are straightforward, I guess: you perform a univariate feature selection, and fit an extra-random trees classifier.
QUESTION
I'm trying to use SequentialFeatureSelector and for estimator
parameter I'm passing it a pipeline that includes a step that inputes the missing values:
ANSWER
Answered 2021-Feb-08 at 21:16ScikitLearn's documentation does not state that the SequentialFeatureSelector works with pipeline objects. It only states that the class accepts an unfitted estimator. In view of this, you could remove the classifier from your pipeline, preprocess X, and then pass it along with an unfitted classifier for feature selection as shown in the example below.
QUESTION
I am trying to scale my data within the crossvalidation folds of a MLENs Superlearner pipeline. When I use StandardScaler in the pipeline (as demonstrated below), I receive the following warning:
/miniconda3/envs/r_env/lib/python3.7/site-packages/mlens/parallel/_base_functions.py:226: MetricWarning: [pipeline-1.mlpclassifier.0.2] Could not score pipeline-1.mlpclassifier. Details: ValueError("Classification metrics can't handle a mix of binary and continuous-multioutput targets") (name, inst_name, exc), MetricWarning)
Of note, when I omit the StandardScaler() the warning disappears, but the data is not scaled.
...ANSWER
Answered 2021-Apr-06 at 21:50You are currently passing your preprocessing steps as two separate arguments when calling the add method. You can instead combine them as follows:
QUESTION
I want to perform Machine Learning
algorithms from Sklearn
library on all my cores using Dask
and joblib
libraries.
My code for the joblib.parallel_backend with Dask:
...ANSWER
Answered 2021-Apr-06 at 13:49The Dask joblib backend will not be able to parallelize all scikit-learn models, only some of them as indicated in the Parallelism docs. This is because many scikit-learn models only support sequential training either due to the algorithm implementations or because parallel support has not been added.
Dask will only be able to parallelize models that have an n_jobs
paramemeter, which indicates that the scikit-learn model is written in a way to support parallel training. RFE
and DecisionTreeClassifier
do not have an n_jobs
paramemter. I wrote this gist that you can run to get a full list of the models that support parallel training
QUESTION
According to RandomizedSearchCV documentation (emphasis mine):
param_distributions: dict or list of dicts
Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. If a list of dicts is given, first a dict is sampled uniformly, and then a parameter is sampled using that dict as above.
If my understanding of the above is correct, both algorithms (XGBClassifier and LogisticRegression) in the following example should be sampled with high probability (>99%), given n_iter = 10.
...ANSWER
Answered 2021-Mar-24 at 02:51Yes, this is incorrect behavior. There's an Issue filed: when all the entries are lists (none are scipy distributions), the current code selects points from the ParameterGrid
, which means it will disproportionately choose points from the larger dictionary-grid from your list.
Until a fix gets merged, you might be able to work around this by using a scipy distribution for something you don't care about, say for verbose
?
QUESTION
The following code works using a random forest model to give me a chart showing feature importance:
...ANSWER
Answered 2021-Mar-22 at 17:54Logistic regression does not have an attribute for ranking feature. If you want to visualize the coefficients that you can use to show feature importance. Basically, we assume bigger coefficents has more contribution to the model but have to be sure that the features has THE SAME SCALE otherwise this assumption is not correct. Note that, some coefficents could be negative so your plot will looks different if you want to order them like you did on your plot, you can convert them to positive.
After you fit the logistic regression model, You can visualize your coefficents:
QUESTION
I would like to be able to reproduce sklearn SelectKBest
results when using GridSearchCV
by performing the grid-search CV myself. However, I find my code to produce different results. Here is a reproducible example:
ANSWER
Answered 2021-Mar-15 at 17:45Edit: restructured my answer, since it seems you are after more of a "why?" and "how should I?" vs a "how can I?"
The IssueThe scorer that you're using in GridSearchCV isn't being passed the output of predict_proba
like it is in your loop version. It's being passed the output of decision_function
. For SVM's the argmax of the probabilities may differ from the decisions, as described here:
How I would Fix ItThe cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the scores:
the “argmax” of the scores may not be the argmax of the probabilities
in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.
Use SVC(probability = False, ...)
in both the Pipeline/GridSearchCV approach and the loop, and decision_function
in the loop instead of predict_proba
. According to this blurb above, this will also speed up your code.
To make your loop match GridSearchCV, leaving the GridSearchCV approach alone:
QUESTION
I'm using a slightly modified code from here:Ensemble Methods: Tuning a XGBoost model with Scikit-Learn
When I execute it, I keep getting this error:
...ANSWER
Answered 2021-Mar-07 at 01:02There are 4 features (Number1
, Color1
, Number2
, Trait1
).
SelectKBest
will select the K
most explicative features out of the original set, so K
should be a value greater than 0
and lower or equal than the total number of features.
You are setting the GridSearch object to use always 10
in this line:
QUESTION
I have been experimenting with RFECV on the Boston dataset.
My understanding, thus far, is that to prevent data-leakage, it is important to perform activities such as this, only on the training data and not the whole dataset.
I performed RFECV on just the training data, and it indicated that 13 of the 14 features are optimal. However, I then ran the same process on the whole dataset, and this time around, it indicated that only 6 of the features are optimal - which seems more likely.
To illustrate:
...ANSWER
Answered 2021-Feb-19 at 16:40It certainly seems like unexpected behavior, and especially when, as you say, you can reduce the test size to 10% or even 5% and find a similar disparity, which seems very counter-intuitive. The key to understanding what's going on here is to realize that for this particular dataset the values in each column are not randomly distributed across the rows (for example, try running X['CRIM'].plot()
). The train_test_split
function you're using to split the data has a parameter shuffle
which defaults to True
. So if you look at the X_train
dataset you'll see that the index is jumbled up, whereas in X
it is sequential. This means that when the cross-validation is performed under the hood by the RFECV
class, it is getting a biased subset of data in each split of X
, but a more representative/random subset of data in each split of X_train
. If you pass shuffle=False
to train_test_split
you'll see that the two results are much closer (or alternatively, and probably better, try shuffling the index of X
).
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install feature_selection
You can use feature_selection like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page