Support
Quality
Security
License
Reuse
kandi has reviewed imbalanced-learn and discovered the below as its top functions. This is intended to give you an instant insight into imbalanced-learn implemented functionality, and help decide if they suit your requirements.
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Get error: unexpected keyword argument 'random_state' when using TomekLinks
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9],
n_informative=3, n_redundant=1,
flip_y=0, n_features=20,
n_clusters_per_class=1,
n_samples=1000, random_state=10
)
feature importance bagging classifier and column names
...
bunch = load_iris()
X = bunch.data
y = bunch.target
feature_names = bunch.feature_names
clf = BaggingClassifier(DecisionTreeClassifier(), random_state=42)
clf.fit(X, y)
feature_importances = np.mean([tree.feature_importances_ for tree in clf.estimators_], axis=0)
output = {fn:fi for fn,fi in zip(feature_names,feature_importances)}
print(output)
{
'sepal length (cm)': 0.008652347823679744,
'sepal width (cm)': 0.01945400672681583,
'petal length (cm)': 0.539297348817521,
'petal width (cm)': 0.43259629663198346
}
-----------------------
...
bunch = load_iris()
X = bunch.data
y = bunch.target
feature_names = bunch.feature_names
clf = BaggingClassifier(DecisionTreeClassifier(), random_state=42)
clf.fit(X, y)
feature_importances = np.mean([tree.feature_importances_ for tree in clf.estimators_], axis=0)
output = {fn:fi for fn,fi in zip(feature_names,feature_importances)}
print(output)
{
'sepal length (cm)': 0.008652347823679744,
'sepal width (cm)': 0.01945400672681583,
'petal length (cm)': 0.539297348817521,
'petal width (cm)': 0.43259629663198346
}
Cant install imbalanced-learn on an Azure ML Environment
!pip install -U "imbalanced-learn < 0.9"
Download UNIX python wheel in windows
$ pip download --only-binary=:all: --platform win_amd64 imbalanced_learn
[...]
Saved ./imbalanced_learn-0.9.0-py3-none-any.whl
Saved ./joblib-1.1.0-py2.py3-none-any.whl
Saved ./numpy-1.22.2-cp39-cp39-win_amd64.whl
Saved ./scikit_learn-1.0.2-cp39-cp39-win_amd64.whl
Saved ./scipy-1.8.0-cp39-cp39-win_amd64.whl
Saved ./threadpoolctl-3.1.0-py3-none-any.whl
[...]
$ pip download --only-binary=:all: --platform manylinux2014_x86_64 imbalanced_learn
[...]
Saved ./imbalanced_learn-0.9.0-py3-none-any.whl
Saved ./joblib-1.1.0-py2.py3-none-any.whl
Saved ./numpy-1.22.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Saved ./scikit_learn-1.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Saved ./scipy-1.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Saved ./threadpoolctl-3.1.0-py3-none-any.whl
[...]
Cannot find conda info. Please verify your conda installation on EMR
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
&& /bin/bash ~/miniconda.sh -b -p $HOME/conda
echo -e '\n export PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
conda config --set always_yes yes --set changeps1 no
conda config -f --add channels conda-forge
conda create -n zoo python=3.7 # "zoo" is conda environment name
conda init bash
source activate zoo
conda install python 3.7.0 -c conda-forge orca
sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
“spark.pyspark.python": "/home/hadoop/conda/envs/zoo/bin/python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/home/hadoop/conda/envs/zoo/bin/,
"zeppelin.pyspark.python" : "/home/hadoop/conda/bin/python",
"zeppelin.python": "/home/hadoop/conda/bin/python"
-----------------------
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh -O /home/hadoop/miniconda.sh \
&& /bin/bash ~/miniconda.sh -b -p $HOME/conda
echo -e '\n export PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
conda config --set always_yes yes --set changeps1 no
conda config -f --add channels conda-forge
conda create -n zoo python=3.7 # "zoo" is conda environment name
conda init bash
source activate zoo
conda install python 3.7.0 -c conda-forge orca
sudo /home/hadoop/conda/envs/zoo/bin/python3.7 -m pip install virtualenv
“spark.pyspark.python": "/home/hadoop/conda/envs/zoo/bin/python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/home/hadoop/conda/envs/zoo/bin/,
"zeppelin.pyspark.python" : "/home/hadoop/conda/bin/python",
"zeppelin.python": "/home/hadoop/conda/bin/python"
Multipoint(df['geometry']) key error from dataframe but key exist. KeyError: 13 geopandas
# https://www.kaggle.com/new-york-state/nys-nyc-transit-subway-entrance-and-exit-data
import kaggle.cli
import sys, requests, urllib
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
# fmt: off
# download data set
url = "https://www.kaggle.com/new-york-state/nys-nyc-transit-subway-entrance-and-exit-data"
sys.argv = [sys.argv[0]] + f"datasets download {urllib.parse.urlparse(url).path[1:]}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f'{urllib.parse.urlparse(url).path.split("/")[-1]}.zip')
dfs = {f.filename: pd.read_csv(zfile.open(f)) for f in zfile.infolist() if Path(f.filename).suffix in [".csv"]}
# fmt: on
df_subway = dfs['nyc-transit-subway-entrance-and-exit-data.csv']
from shapely.geometry import Point, MultiPoint
from shapely.ops import nearest_points
import geopandas as gpd
geometry = [Point(xy) for xy in zip(df_subway['Station Longitude'], df_subway['Station Latitude'])]
# Coordinate reference system :
crs = {'init': 'EPSG:4326'}
# Creating a Geographic data frame
gdf_subway_entrance_geometry = gpd.GeoDataFrame(df_subway, crs=crs, geometry=geometry).to_crs('EPSG:5234')
gdf_subway_entrance_geometry
df_yes_entry = gdf_subway_entrance_geometry
df_yes_entry = gdf_subway_entrance_geometry[gdf_subway_entrance_geometry.Entry=='YES']
df_yes_entry
# randomly select a point....
gpdPoint = gdf_subway_entrance_geometry.sample(1).geometry.tolist()[0]
pts = MultiPoint(df_yes_entry['geometry'].values) # does not work with a geopandas series, works with a numpy array
pt = Point(gpdPoint.x, gpdPoint.y)
#[o.wkt for o in nearest_points(pt, pts)]
for o in nearest_points(pt, pts):
print(o)
How to install PyCaret in AWS Glue
I [Meghana] was able to successfully replicate the error at my end as well using the below steps. Also, please find the workaround mentioned below to avoid the same.
- Create a Glue 2.0 job and configure the Job Parameters as below
===
-- additional-python-modules pycaret==2.3.2,spacy==3.0.6
===
On running the Glue job I see that it tried to a Pip3 install, Please find the same below
===
INFO 2021-07-07 05:23:15,813 0 com.amazonaws.services.glue.PythonModuleInstaller [main] pip3 install --user pycaret==2.3.2 spacy==3.0.6
===
However, pip install can be called against a pypi package, local project or wheel hosted via HTTPS we can use this functionality to install packages publicly hosted on pypi as well as those not available publicly available. This gives us the ability to install the majority of packages to use with Glue, including those which are c-based. There is however a subset that will fail (like spacy); packages that require root privileges to install or to be compiled during installation will fail. Glue does not give root access and there is no exception made for package installation. What can be done in this case is pre-compiling the binaries into a wheel compatible with Glue and installing that wheel.
Please find the below error that I have observed while installing the above “pycaret==2.3.2,spacy==3.0.6” and import them in glue job failed as the compilation failed while replicating.
==
INFO 2021-07-07 05:23:15,813 0 com.amazonaws.services.glue.PythonModuleInstaller [main] pip3 install --user pycaret==2.3.2 spacy==3.0.6
INFO 2021-07-07 05:23:26,751 10938 com.amazonaws.services.glue.PythonModuleInstaller [main] Collecting pycaret==2.3.2 Downloading https://files.pythonhosted.org/packages/bc/b6/9d620a23a038b3abdc249472ffd9be217f6b1877d2d952bfb3f653622a28/pycaret-2.3.2-py3-none-any.whl (263kB)Collecting spacy==3.0.6 Downloading https://files.pythonhosted.org/packages/6d/0d/4379e9aa35a444b6440ffe1af4c612533460e0d5ac5c7dca1f96ff6f2e23/spacy-3.0.6.tar.gz (7.1MB) Complete output from command python setup.py egg_info: Error compiling Cython file: ------------------------------------------------------------ ... from libc.stdint cimport int64_t from libcpp.vector cimport vector from libcpp.set cimport set from cymem.cymem cimport Pool ^ ------------------------------------------------------------ spacy/strings.pxd:4:0: 'cymem/cymem.pxd' not found Error compiling Cython file: ------------------------------------------------------------ ... from libc.stdint cimport int64_t from libcpp.vector cimport vector from libcpp.set cimport set from cymem.cymem cimport Pool ^ ------------------------------------------------------------ spacy/strings.pxd:4:0: 'cymem/…
…
..
INFO 2021-07-07 05:23:26,753 10940 com.amazonaws.services.glue.PythonModuleInstaller [main] Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-ahnegkpn/spacy/
==
Thus, the importing of the above failed during compilation stage. The error that you have received below is due the python version mismatch. Here the tar file spacy-2.3.7.tar.gz is compiled using python 3.7 and the Glue job is using 3.6 and hence it failed with the below error. However, even if you provide as mentioned above it still has c-dependencies and stills fails. Please find the work-around to avoid the same.
—
INFO 2021-07-05 17:01:53,986 17142 com.amazonaws.services.glue.PythonModuleInstaller [main] Collecting setuptools Downloading …
7c5f7541eb883181b564a8c8ba15d21b2d7b8a38ae32f31763575cf8857d/spacy-2.3.7.tar.gz (5.8MB)
Complete output from command python setup.py egg_info: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-_e6ejo9m/blis/ Traceback (most recent call last): File "/home/spark/.local/lib/python3.7/site-packages/setuptools/installer.py", line 128, in fetch_build_egg subprocess.check_call(cmd) File "/usr/lib64/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmpo5fqe7du', '--quiet', 'blis<0.8.0,>=0.4.0']' returned non-zero exit status 1. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-build-xoyv9lar/spacy/setup.py", line 252, in <module> setup_package() File "/tmp/pip-build-xoyv9lar/spacy/setup.py", line 247, in setup_package cmdclass={"build_ext": build_ext_subclass}, File "/home/spark/.local/lib/…
File "/home/spark/.local/lib/python3.7/site-packages/setuptools/installer.py", line 130, in fetch_build_egg raise DistutilsError(str(e)) distutils.errors.DistutilsError: Command '['/usr/bin/python3', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmpo5fqe7du', '--quiet', 'blis<0.8.0,>=0.4.0']' returned non-zero exit status 1. ----------------------------------------
—
###Workaround
- To compile a library in a C-based language, the compiler must know the target OS and processor architecture.
- If the library is compiled against a different OS or processor architecture, then the wheel will fail to install in Glue.
- Because Glue is a manages service, we do not give users cluster-access to develop these dependencies.
- Below I will walk you through using a Docker image to prepare an environment you can use to compile wheels that will be compatible with Glue. For this example we will be compiling pycaret,spacy which requires GCC to be installed on the target device as root.
Step-1: Launch an m5.xlarge EC2 instance with Amazon Linux (2) and enough volume space for your libs.
Step-2: Install Docker on the instance, set up nonsudo access, and start it
1. sudo yum install docker -y
2. sudo usermod -a -G docker ec2-user
3. sudo service docker start
Step-3: Create a Dockerfile as below
$ vi docfile
—
# Base for Glue
FROM amazonlinux
RUN yum update -y
RUN yum install shadow-utils.x86_64 -y
RUN yum install -y java-1.8.0-openjdk.x86_64
RUN yum install -y python3
RUN yum install -y gcc autoconf automake libtool zlib-devel openssl-devel maven wget protobuf-compiler cmake make gcc-c++
# Additonal Components needed for psutil
WORKDIR /root
RUN yum install python3-devel -y
RUN yum install python-devel -y
RUN pip3 install wheel
# Install psutil
RUN pip3 install pycaret
RUN pip3 install spacy
# Create a directory for the wheel
RUN mkdir wheel_dir
# create the wheel
RUN pip3 wheel pycaret -w wheel_dir
RUN pip3 wheel spacy -w wheel_dir
—
Step-4: Run docker build to build your Dockerfile
==
restart the docker daemon
[ec2-user@ip-xxx ~]$ sudo service docker restart
[ec2-user@ip-xxx ~]$ docker build -f docfile .
[ec2-user@ip-xxx ~]$ docker build -f docfile .
Sending build context to Docker daemon 16.38kB
Step 1/17 : FROM amazonlinux
---> 7443854fbdb0
Step 2/17 : RUN yum update -y
---> Using cache
…
Removing intermediate container xxx
---> xxx
Successfully built xx
==
[ec2-user@ip-xxx~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[ec2-user@ip-xxx~]$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
<none> <none> xxx 24 seconds ago 4.28GB
<none> <none> xxx 43 minutes ago 3.94GB
[ec2-user@ip-xxx ~]$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Xxx yyy "/bin/bash" 37 minutes ago Exited (0) 37 minutes ago brave_meninsky
Step-5 : Extract the whl from the docker container
- Get the container ID
docker ps (get container ID)
- Run the container and keep it from exiting
docker run -dite <image>
- Verify the location of the wheel file (and get it’s filename)
docker exec -t -i <container_id> ls /root/wheel_dir/
- Copy the wheel out of docker to EC2
docker cp <containerID>:/root/wheel_dir/<wheelFile> .
===
[ec2-user@ip-xxx ~]$ docker run -dite e0e1f71b8fad
9f0b1aff06dd959f3744edd3804512e73b68aaeef178962f2c0c063b290dbf78
[ec2-user@ip-xxx~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9f0b1aff06dd e0e1f71b8fad "/bin/bash" 49 seconds ago Up 48 seconds quirky_bose
[ec2-user@ip-xxx ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
52ab3413c962 e0e1f71b8fad "/bin/bash" 3 seconds ago Up 2 seconds fervent_williamson
9f0b1aff06dd e0e1f71b8fad "/bin/bash" 2 minutes ago Up 2 minutes quirky_bose
[ec2-user@ip-xxx~]$ docker exec -t -i 52ab3413c962 ls /root/wheel_dir/
Boruta-0.3-py3-none-any.whl
Bottleneck-1.3.2-cp37-cp37m-linux_x86_64.whl
…
[ec2-user@ip-xxx ~]$ aws s3 cp pycaret-2.3.2-py3-none-any.whl s3://cxx/
upload: ./pycaret-2.3.2-py3-none-any.whl to s3://cxx/pycaret-2.3.2-py3-none-any.whl
===
Step-6: Upload the wheel to S3
aws s3 cp <wheelFile> s3://path/to/wheel/
Step-7: Pass the S3 URL to Glue
Edit your Job
Expand “Security configuration, script libraries, and job parameters (optional)”
In “Job parameters enter a new Key-Value pair
--
Key --additional-python-modules
Value. <s3URI>
--
How to correct Python Attribute error: 'SMOTE' object has no attribute 'fit_sample'
os_data_X, os_data_y = os.fit_resample(X_train, y_train)
Imbalances-learn module base.py file syntax error coming up while importing SMOTE
class SamplerMixin(BaseEstimator):
__metaclass__ = ABC
How to store result of pip command into Pandas Datafarme
import os
import pandas as pd
# create pip list txt
os.system('pip list > pip_list.txt')
# read content into pandas df
df = pd.read_csv('pip_list.txt', sep=r'\s+', skiprows=[1])
# clean up
os.remove('pip_list.txt')
Package Version
0 appnope 0.1.0
1 argon2-cffi 20.1.0
2 astroid 2.4.2
3 async-generator 1.10
4 attrs 20.2.0
.. ... ...
100 urllib3 1.25.10
101 wcwidth 0.2.5
102 webencodings 0.5.1
103 wrapt 1.12.1
104 xlrd 1.2.0
[105 rows x 2 columns]
-----------------------
import os
import pandas as pd
# create pip list txt
os.system('pip list > pip_list.txt')
# read content into pandas df
df = pd.read_csv('pip_list.txt', sep=r'\s+', skiprows=[1])
# clean up
os.remove('pip_list.txt')
Package Version
0 appnope 0.1.0
1 argon2-cffi 20.1.0
2 astroid 2.4.2
3 async-generator 1.10
4 attrs 20.2.0
.. ... ...
100 urllib3 1.25.10
101 wcwidth 0.2.5
102 webencodings 0.5.1
103 wrapt 1.12.1
104 xlrd 1.2.0
[105 rows x 2 columns]
-----------------------
import subprocess
from io import StringIO
command = "pip list"
process = subprocess.Popen(command.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
pd.read_csv(StringIO(output.decode("utf-8")), sep=r"\s*", skiprows=[1])
QUESTION
Get error: unexpected keyword argument 'random_state' when using TomekLinks
Asked 2022-Mar-24 at 14:44My code is:
undersample = TomekLinks(sampling_strategy='majority', n_jobs= -1, random_state = 42)
X_tl, y_tl = undersample.fit_resample(X, y)
When I run it, I get this error:
TypeError: __init__() got an unexpected keyword argument 'random_state'
My package version is:
imbalanced-learn==0.9.0
although in the documentation this parameter exists:
random_state : int, RandomState instance or None, optional (default=None)
when I check the constructor in _tomek_links.py
, I don't see the random state field:
@_deprecate_positional_args
def __init__(self, *, sampling_strategy="auto", n_jobs=None):
super().__init__(sampling_strategy=sampling_strategy)
self.n_jobs = n_jobs
ANSWER
Answered 2022-Mar-24 at 14:44I think, you're looking at the wrong documentation. That one is for version 0.3.0-dev
, so I checked: https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.TomekLinks.html -- this parameter has been deprecated in a newer version 0.9.0
.
Also as the documentation goes, seems you have to specify it in make_classification
function as below:
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9],
n_informative=3, n_redundant=1,
flip_y=0, n_features=20,
n_clusters_per_class=1,
n_samples=1000, random_state=10
)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
No vulnerabilities reported
Save this library and start creating your kit
Explore Related Topics
Save this library and start creating your kit