feature_selection | feature selection by using random forest | Data Mining library

by Rivercan Python Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | feature_selection Summary

feature_selection is a Python library typically used in Data Processing, Data Mining applications. feature_selection has no bugs, it has no vulnerabilities, it has build file available and it has low support. You can download it from GitHub.

use random forest to find important feature.

Support

Quality

Security

License

Reuse

Support

feature_selection has a low active ecosystem.

It has 9 star(s) with 6 fork(s). There are no watchers for this library.

It had no major release in the last 6 months.

feature_selection has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of feature_selection is current.

Quality

feature_selection has no bugs reported.

Security

feature_selection has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

feature_selection does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

feature_selection releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of feature_selection

Get all kandi verified functions for this library.

feature_selection Key Features

No Key Features are available at this moment for feature_selection.

feature_selection Examples and Code Snippets

No Code Snippets are available at this moment for feature_selection.

Community Discussions

Trending Discussions on feature_selection

Why are feature selection results different for Random forest classifier when applied in two different ways

Explanation of pipeline generated by tpot

Scikit-learn SequentialFeatureSelector Input contains NaN, infinity or a value too large for dtype('float64'). even with pipeline

Using StandardScaler as Preprocessor in Mlens Pipeline generates Classification Warning

Why running Sklearn machine learning with Dask doesn't result in parallelism?

RandomizedSearchCV sampling distribution

sklearn important features error when using logistic regression

Reproducing Sklearn SVC within GridSearchCV's roc_auc scores manually

K value for SelectKBest() using chi2

Why is Scikit-learn RFECV returning very different features for the training dataset?

QUESTION

Why are feature selection results different for Random forest classifier when applied in two different ways

Asked 2021-Jun-06 at 17:10

I want to do feature selection and I used Random forest classifier but did differently.

I used sklearn.feature_selection.SelectfromModel(estimator=randomforestclassifer...) and used random forest classifier standalone. It was surprising to find that although I used the same classifier, the results were different. Except for some two features, all others were different. Can someone explain why is it so? Maybe is it because the parameters change in these two cases?

...

ANSWER

Answered 2021-Jun-06 at 17:10

This could be because select_from_model refits the estimator by default and sklearn.ensembe.RandomForestClassifier has two pseudo random parameters: bootsrap, which is set to True by default, and max_features, which is set to 'auto' by default.

If you did not set a random_state in your randomforestclassifier estimator, then it will most likely yield different results every time you fit the model because of the randomness introduced by the bootstrap and max_features parameters, even on the same training data.

bootstrap=True means that each tree will be trained on a random sample (with replacement) of a certain percentage of the observations from the training dataset.
max_features='auto' means that when building each node, only the square root of the number of features in your training data will be considered to pick the cutoff point that reduces the gini impurity most.

You can do two things to ensure you get the same results:

Train your estimator first and then use select_from_model(randomforestclassifier, refit=False).
Declare randomforestclassifier with a random seed and then use select_from_model.

Needless to say, both options require you to pass the same X and y data.

Source https://stackoverflow.com/questions/67860624

QUESTION

Explanation of pipeline generated by tpot

Asked 2021-May-20 at 14:28

I was using tpotClassifier() and got the following pipeline as my optimal pipeline. I am attaching my pipeline code which I got. Can someone explain the pipeline processes and order?

...

ANSWER

Answered 2021-May-20 at 14:28

make_union just unions multiple datasets, and FunctionTransformer(copy) duplicates all the columns. So the nested make_union and FunctionTransformer(copy) makes several copies of each feature. That seems very odd, except that with ExtraTreesClassifier it will have an effect of "bootstrapping" the feature selections. See also Issue 581 for an explanation for why these are generated in the first place; basically, adding copies is useful in stacking ensembles, and the genetic algorithm used by TPOT means it needs to generate those first before exploring such ensembles. There it is recommended that doing more iterations of the genetic algorithm may clean up such artifacts.

After that things are straightforward, I guess: you perform a univariate feature selection, and fit an extra-random trees classifier.

Source https://stackoverflow.com/questions/67616170

QUESTION

Scikit-learn SequentialFeatureSelector Input contains NaN, infinity or a value too large for dtype('float64'). even with pipeline

Asked 2021-Apr-15 at 15:33

I'm trying to use SequentialFeatureSelector and for estimator parameter I'm passing it a pipeline that includes a step that inputes the missing values:

...

ANSWER

Answered 2021-Feb-08 at 21:16

ScikitLearn's documentation does not state that the SequentialFeatureSelector works with pipeline objects. It only states that the class accepts an unfitted estimator. In view of this, you could remove the classifier from your pipeline, preprocess X, and then pass it along with an unfitted classifier for feature selection as shown in the example below.

Source https://stackoverflow.com/questions/66106909

QUESTION

Using StandardScaler as Preprocessor in Mlens Pipeline generates Classification Warning

Asked 2021-Apr-06 at 21:50

I am trying to scale my data within the crossvalidation folds of a MLENs Superlearner pipeline. When I use StandardScaler in the pipeline (as demonstrated below), I receive the following warning:

/miniconda3/envs/r_env/lib/python3.7/site-packages/mlens/parallel/_base_functions.py:226: MetricWarning: [pipeline-1.mlpclassifier.0.2] Could not score pipeline-1.mlpclassifier. Details: ValueError("Classification metrics can't handle a mix of binary and continuous-multioutput targets") (name, inst_name, exc), MetricWarning)

Of note, when I omit the StandardScaler() the warning disappears, but the data is not scaled.

...

ANSWER

Answered 2021-Apr-06 at 21:50

You are currently passing your preprocessing steps as two separate arguments when calling the add method. You can instead combine them as follows:

Source https://stackoverflow.com/questions/66959756

QUESTION

Why running Sklearn machine learning with Dask doesn't result in parallelism?

Asked 2021-Apr-06 at 13:49

I want to perform Machine Learning algorithms from Sklearn library on all my cores using Dask and joblib libraries.

My code for the joblib.parallel_backend with Dask:

...

ANSWER

Answered 2021-Apr-06 at 13:49

The Dask joblib backend will not be able to parallelize all scikit-learn models, only some of them as indicated in the Parallelism docs. This is because many scikit-learn models only support sequential training either due to the algorithm implementations or because parallel support has not been added.

Dask will only be able to parallelize models that have an n_jobs paramemeter, which indicates that the scikit-learn model is written in a way to support parallel training. RFE and DecisionTreeClassifier do not have an n_jobs paramemter. I wrote this gist that you can run to get a full list of the models that support parallel training

Source https://stackoverflow.com/questions/66959643

QUESTION

RandomizedSearchCV sampling distribution

Asked 2021-Mar-24 at 02:51

According to RandomizedSearchCV documentation (emphasis mine):

param_distributions: dict or list of dicts

Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. If a list of dicts is given, first a dict is sampled uniformly, and then a parameter is sampled using that dict as above.

If my understanding of the above is correct, both algorithms (XGBClassifier and LogisticRegression) in the following example should be sampled with high probability (>99%), given n_iter = 10.

...

ANSWER

Answered 2021-Mar-24 at 02:51

Yes, this is incorrect behavior. There's an Issue filed: when all the entries are lists (none are scipy distributions), the current code selects points from the ParameterGrid, which means it will disproportionately choose points from the larger dictionary-grid from your list.

Until a fix gets merged, you might be able to work around this by using a scipy distribution for something you don't care about, say for verbose?

Source https://stackoverflow.com/questions/66770981

QUESTION

sklearn important features error when using logistic regression

Asked 2021-Mar-22 at 21:49

The following code works using a random forest model to give me a chart showing feature importance:

...

ANSWER

Answered 2021-Mar-22 at 17:54

Logistic regression does not have an attribute for ranking feature. If you want to visualize the coefficients that you can use to show feature importance. Basically, we assume bigger coefficents has more contribution to the model but have to be sure that the features has THE SAME SCALE otherwise this assumption is not correct. Note that, some coefficents could be negative so your plot will looks different if you want to order them like you did on your plot, you can convert them to positive.

After you fit the logistic regression model, You can visualize your coefficents:

Source https://stackoverflow.com/questions/66750706

QUESTION

Reproducing Sklearn SVC within GridSearchCV's roc_auc scores manually

Asked 2021-Mar-15 at 17:45

I would like to be able to reproduce sklearn SelectKBest results when using GridSearchCV by performing the grid-search CV myself. However, I find my code to produce different results. Here is a reproducible example:

...

ANSWER

Answered 2021-Mar-15 at 17:45

Edit: restructured my answer, since it seems you are after more of a "why?" and "how should I?" vs a "how can I?"

The Issue

The scorer that you're using in GridSearchCV isn't being passed the output of predict_proba like it is in your loop version. It's being passed the output of decision_function. For SVM's the argmax of the probabilities may differ from the decisions, as described here:

The cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the scores:

the “argmax” of the scores may not be the argmax of the probabilities

in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.

How I would Fix It

Use SVC(probability = False, ...) in both the Pipeline/GridSearchCV approach and the loop, and decision_function in the loop instead of predict_proba. According to this blurb above, this will also speed up your code.

My Original, Literal Answer to Your Question

To make your loop match GridSearchCV, leaving the GridSearchCV approach alone:

Source https://stackoverflow.com/questions/66499364

QUESTION

K value for SelectKBest() using chi2

Asked 2021-Mar-07 at 01:02

I'm using a slightly modified code from here:Ensemble Methods: Tuning a XGBoost model with Scikit-Learn

When I execute it, I keep getting this error:

...

ANSWER

Answered 2021-Mar-07 at 01:02

There are 4 features (Number1, Color1, Number2, Trait1).

SelectKBest will select the K most explicative features out of the original set, so K should be a value greater than 0 and lower or equal than the total number of features.

You are setting the GridSearch object to use always 10 in this line:

Source https://stackoverflow.com/questions/66507073

QUESTION

Why is Scikit-learn RFECV returning very different features for the training dataset?

Asked 2021-Feb-19 at 16:40

I have been experimenting with RFECV on the Boston dataset.

My understanding, thus far, is that to prevent data-leakage, it is important to perform activities such as this, only on the training data and not the whole dataset.

I performed RFECV on just the training data, and it indicated that 13 of the 14 features are optimal. However, I then ran the same process on the whole dataset, and this time around, it indicated that only 6 of the features are optimal - which seems more likely.

To illustrate:

...

ANSWER

Answered 2021-Feb-19 at 16:40

It certainly seems like unexpected behavior, and especially when, as you say, you can reduce the test size to 10% or even 5% and find a similar disparity, which seems very counter-intuitive. The key to understanding what's going on here is to realize that for this particular dataset the values in each column are not randomly distributed across the rows (for example, try running X['CRIM'].plot()). The train_test_split function you're using to split the data has a parameter shuffle which defaults to True. So if you look at the X_train dataset you'll see that the index is jumbled up, whereas in X it is sequential. This means that when the cross-validation is performed under the hood by the RFECV class, it is getting a biased subset of data in each split of X, but a more representative/random subset of data in each split of X_train. If you pass shuffle=False to train_test_split you'll see that the two results are much closer (or alternatively, and probably better, try shuffling the index of X).

Source https://stackoverflow.com/questions/66269564

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install feature_selection

You can download it from GitHub.
You can use feature_selection like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: