OneHotEncode | python script to deploy | Machine Learning library
kandi X-RAY | OneHotEncode Summary
kandi X-RAY | OneHotEncode Summary
A python script to deploy One-Hot encoding in Pandas Dataframes
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- One - hot encode a Pandas dataframe .
OneHotEncode Key Features
OneHotEncode Examples and Code Snippets
from OneHotEncode.OneHotEncode import *
df,dropped_cols,all_new_cols,new_col_dict = OneHotEncode(df,Categorical_column_list,check_numerical=False,max_var=20)
pandas_dataframe -> The Pandas Dataframe object that contains the column you want to on
df['col_name'].fillna('most_frequent_category',inplace=True)
df['col_name'].fillna('Other',inplace=True)
class OneHotEncoder(_BaseEncoder):
def __init__(self, categories='auto', drop=None, sparse=True,
dtype=np.float64, handle_unknown='error'):
self.categories = categories
self.sparse = sparse
self
# example dataframe
df = pd.DataFrame({'col1':[1,2,3],
'col2':['a','b','a'],
'col3':[4,5,6],
'col4':['aaa', 'bbb', 'bbb']})
col1 col2 col3 col4
0 1 a 4 aaa
1 2
df.Majors.str.get_dummies(sep=',')
Ceramics Dance Drawing Visual Arts Writing Architecture Biology ...
0 0 0 0 1 0 0 0
1 1
onehotencoder1 = OneHotEncoder(categorical_features = [0])
X = onehotencoder1.fit_transform(X).toarray()
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_x=LabelEncoder()
X[:, 0]=labelencoder_x.fit_transform(X[:,0])
onehotencoder_x=OneHotEncoder(categorical_features=[0])
X=onehotencoder_x.fit_transform(X).toarray(
s = pd.Series(list('abca'))
Output:
0 a
1 b
2 c
3 a
pd.get_dummies(s)
Output:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncode(categories = "auto")
X_train_encoded = encoder.fit_transform("X_train")
X_test_encoded = encoder.transform("X_test")
pipe1=make_pipeline(OneHot
Community Discussions
Trending Discussions on OneHotEncode
QUESTION
For reference:
- Python 3.8.3
- sklearn 1.0.2
I have a scikit-learn pipeline
that formats some data for me, described below:
I define my pipeline
like so:
ANSWER
Answered 2022-Apr-11 at 22:11I guess this post may help:
Namely, the problem should just be sklearn's version. The PRs referenced in what I posted a couple of months ago seem to have just been merged, though a new release has not been there yet since then. Installing the actual development version of sklearn, scikit-learn 1.1.dev0
should do the trick (it did it for me, at least).
You can install the nightly builds as such: pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn -U
.
Here's an example on a toy dataset:
QUESTION
I'm trying to use CountVectorizer()
with Pipeline
and ColumnTransformer
. Because CountVectorizer()
produces sparse matrix, I used FunctionTransformer
to ensure the ColumnTransformer
can hstack
correctly when putting together the resulting matrix.
ANSWER
Answered 2022-Apr-09 at 06:20I think you should really look back over your basics again. Your question tells me you don’t understand the function well enough to implement it effectively. Ask again when you’ve done enough research on your own to not embarrass yourself.
QUESTION
I'm using sklearn.pipeline
to transform my features and fit a model, so my general flow looks like this: column transformer --> general pipeline --> model. I would like to be able to extract feature names from the column transformer (since the following step, general pipeline applies the same transformation to all columns, e.g. nan_to_zero) and use them for model explainability (e.g. feature importance). I'd also like it to work with custom transformer classes too.
Here is the set up:
...ANSWER
Answered 2022-Apr-01 at 19:57It seems the problem is generated by the encode="ordinal"
parameter passed to the KBinsDiscretizer
constructor. The bug is tracked in GitHub issue #22731 and GitHub issue #22841 and solved with PR #22735.
Indeed, you might see that by specifying encode="onehot"
you might get a consistent result:
QUESTION
I wanted to make all of the custom transformations I make to my data in a pipe. I thought that I could use it as pipe.fit_transform(X)
to transform my X before using it in a model, but I also thought that I'll be able to append to the pipeline model itself and simply use it as one using pipe.steps.append(('model', self.model))
.
Unfortunately, after everything was built I've noticed that I'm getting different results when transforming the data and using it directly in a model vs doing everything in one pipeline. Have anyone experienced anything like this?
Adding code:
...ANSWER
Answered 2022-Mar-29 at 18:07The one transformer that stands out to me is data_cat_mix
, specifically the count-of-level columns. When applied to train+test, these are consistent (but leaks test information); when applied separately, the values in train will generally be much higher (just from its size being three times larger), so the model doesn't really understand how to treat them in the test set.
QUESTION
I have been doing some NLP categorisation tasks and noticed that my models train much faster if I use post-padding instead of pre-padding, and was wondering why that is the case.
I am using Google Colab to train these model with the GPU runtime. Here is my preprocessing code:
...ANSWER
Answered 2022-Mar-20 at 12:56This is related to the underlying LSTM
implementation. There are in fact two: A "native Tensorflow" one and a highly optimized pure CUDA implementation which is MUCH faster. However, the latter can only be used under specific conditions (certain parameter settings etc.). You can find details in the docs. The main point here is:
Inputs, if use masking, are strictly right-padded.
This implies that the pre-padding version does not use the efficient implementation, which explains the much slower runtime. I don't think there is a reasonable workaround here except for sticking with post-padding.
Note that sometimes, Tensorflow actually outputs a warning message that it had to use the inefficient implementation. However, for me this has been inconsistent. Maybe keep your eyes out if any additional warning outputs are produced in the pre-padding case.
QUESTION
The code I have so far is below and it works perfectly. However, I would like to print the following RFE attributes for each number of features tested: "rfe.support_[i]", "rfe.ranking_[i]" and the name of the selected features since "i" refers to the index, the first attribute returns True or False (if the columns were selected or not) and the second one returns their respective rankings.
In other words, I would like to print the columns considered in each RFE and that they do not remain as something abstract.
...ANSWER
Answered 2022-Feb-26 at 22:29Point is that you haven't explicitly fitted the 'DecisionTreeRegressor_2'
pipeline.
Indeed, though cross_val_score
already takes care of fitting the estimator as you might see here, cross_val_score
does not return the estimator instance, as .fit()
method does. Therefore you're not able to access the RFE
instance attributes.
Here's a toy example from your setting:
QUESTION
I already referred the posts here, here and here. Don't mark it as duplicate.
I am working on a binary classification problem where my dataset has categorical and numerical columns.
However, some of the categorical columns has a mix of numeric and string values. Nontheless, they only indicate the category name.
For instance, I have a column called biz_category
which has values like A,B,C,4,5
etc.
I guess the below error is thrown due to values like 4 and 5
.
Therefore, I tried the belowm to convert them into category
datatype. (but still it doesn't work)
ANSWER
Answered 2022-Feb-20 at 14:22SMOTE
requires the values in each categorical/numerical column to have uniform datatype. Essentially you can not have mixed datatypes in any of the column in this case your biz_category
column. Also merely casting the column to categorical type does not necessarily mean that the values in that column will have uniform datatype.
One possible solution to this problem is to re-encode the values in those columns which have mixed data types for example you could use lableencoder but I think in your case simply changing the dtype
to string
would also work.
QUESTION
I am using sklearn
and mlxtend.regressor.StackingRegressor
to build a stacked regression model.
For example, say I want the following small pipeline:
- A Stacking Regressor with two regressors:
- A pipeline which:
- Performs data imputation
- 1-hot encodes categorical features
- Performs linear regression
- A pipeline which:
- Performs data imputation
- Performs regression using a Decision Tree
- A pipeline which:
Unfortunately this is not possible, because StackingRegressor
doesn't accept NaN
in its input data.
This is even if its regressors know how to handle NaN
, as it would be in my case where the regressors are actually pipelines which perform data imputation.
However, this is not a problem: I can just move data imputation outside the stacked regressor. Now my pipeline looks like this:
- Perform data imputation
- Apply a Stacking Regressor with two regressors:
- A pipeline which:
- 1-hot encodes categorical features
- Standardises numerical features
- Performs linear regression
- An
sklearn.tree.DecisionTreeRegressor
.
- A pipeline which:
One might try to implement it as follows (the entire minimal working example in this gist, with comments):
...ANSWER
Answered 2022-Feb-18 at 21:31Imo the issue has to be ascribed to StackingRegressor
. Actually, I am not an expert on its usage and still I have not explored its source code, but I've found this sklearn issue - #16473 which seems implying that << the concatenation [of regressors and meta_regressors] does not preserve dataframe >> (though this is referred to sklearn
StackingRegressor
instance, rather than on mlxtend
one).
Indeed, have a look at what happens once you replace it with your sr_linear
pipeline:
QUESTION
I am working on a machine learning project with very sparsely labeled data. There are several categorical features, resulting in roughly one hundred different classes between the features.
For example:
...ANSWER
Answered 2022-Feb-09 at 18:43Never really worked with sparse matrix, but one way is to remove the column corresponding to your nan
value. Get the categories_
from your model and create a Boolean mask where is it not nan
(I use pd.Series.notna
but probably other way) and create a new (or reassign) sparse matrix. Basically add to your code:
QUESTION
I want to match the output np array with the features to make a new pandas dataframe
Here is my pipeline:
...ANSWER
Answered 2022-Feb-09 at 11:13Point is that, as of today, some transformers do expose a method .get_feature_names_out()
and some others do not, which generates some problems - for instance - whenever you want to create a well-formatted DataFrame
from the np.array outputted by a Pipeline
or ColumnTransformer
instance. (Instead, afaik, .get_feature_names()
was deprecated in latest versions in favor of .get_feature_names_out()
).
For what concerns the transformers that you are using, StandardScaler
belongs to the first category of transformers exposing the method, while both SimpleImputer
and OrdinalEncoder
do belong to the second. The docs show the exposed methods within the Methods paragraphs. As said, this causes problems when doing something like pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out())
on your pipeline
, but it would cause problems as well on your categorical_preprocessing
and continuous_preprocessing
pipelines (as in both cases at least one transformer lacks of the method) and on the preprocessing
ColumnTransformer
instance.
There's an ongoing attempt in sklearn
to enrich all estimators with the .get_feature_names_out()
method. It is tracked within github issue #21308, which, as you might see, branches in many PRs (each one dealing with a specific module). For instance, issue #21079 for the preprocessing module, which will enrich the OrdinalEncoder
among the others, issue #21078 for the impute module, which will enrich the SimpleImputer
. I guess that they'll be available in a new release as soon as all the referenced PR will be merged.
In the meanwhile, imo, you should go with a custom solution that might fit your needs. Here's a simple example, which do not necessarily resemble your need, but which is meant to give a (possible) way of proceeding:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install OneHotEncode
You can use OneHotEncode like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page