estimators | Machine Learning Versioning made Simple | Machine Learning library
kandi X-RAY | estimators Summary
kandi X-RAY | estimators Summary
Machine Learning Versioning made Simple
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Save a NumPy array to disk
- Save an object
- Set the data to be displayed
- Get the shape of a DataFrame
- Set the object
- Load the object
- Get object property
- Compute the hash of an object
- Sets the X test
- Returns the proxy object
- Return the estimator object
- Set the proxy for the prediction
- Sets the y - test object
- Sets the estimator s estimator
estimators Key Features
estimators Examples and Code Snippets
Community Discussions
Trending Discussions on estimators
QUESTION
Edited from a tutorial in Kaggle, I try to run the code below and data (available to download from here):
Code:
...ANSWER
Answered 2022-Mar-17 at 10:58The reason behind this is that StandardScaler
returns a numpy.ndarray
of your feature values (same shape as pandas.DataFrame.values
, but not normalized) and you need to convert it back to pandas.DataFrame
with the same column names.
Here's the part of your code that needs changing.
QUESTION
The documentations of how to use SageMaker estimators are scattered around, sometimes obsolete, incorrect. Is there a one stop location which gives the comprehensive views of how to use SageMaker SDK Estimator to train and save models?
...ANSWER
Answered 2022-Mar-12 at 19:39There is no one such resource from AWS that provides the comprehensive view of how to use SageMaker SDK Estimator to train and save models.
Alternative Overview DiagramI put a diagram and brief explanation to get the overview on how SageMaker Estimator runs a training.
SageMaker sets up a docker container for a training job where:
- Environment variables are set as in SageMaker Docker Container. Environment Variables.
- Training data is setup under
/opt/ml/input/data
. - Training script codes are setup under
/opt/ml/code
. /opt/ml/model
and/opt/ml/output
directories are setup to store training outputs.
QUESTION
In the docs it is said that metaclassifier is trained through cross_val_predict
. From my perspective it means that data is splitten by folds, and all base estimators predict values on one fold, trained on all other folds. And that procedure goes for every fold. Then metaclassifier is trained on predictions of base estimators on these folds. Is it correct? If so, doesn't it contradict to
Note that
estimators_
are fitted on the fullX
in the way that base estimators are trained on several folds, not full X
?
ANSWER
Answered 2022-Mar-02 at 15:08There is no contradiction, because estimators_
is not used when training the metaclassifier. After the cross-val-predictions are made, you don't actually have fitted base estimators (or rather, you have multiple copies of each, depending on your cv
parameter). For predicting on new data, you need a single fitted copy of each base estimator; those are obtained by fitting on the full X
, and are stored in the attribute estimators_
.
QUESTION
Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api. To clarify:
- I have to use LightGBM in general, there is no option here.
- The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
- I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.
Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator
API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.
Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator
API from that package does not support on-the-fly cluster configuration either.
Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?
One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.
Has anyone done anything similar in the past?
Thanks
...ANSWER
Answered 2022-Feb-25 at 12:57Thanks for your questions. Here are answers:
SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container
Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:
- each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
- use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)
In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.
QUESTION
I tried to create stacking regressor to predict multiple output with SVR and Neural network as estimators and final estimator is linear regression.
...ANSWER
Answered 2022-Feb-25 at 00:19Imo the point here is the following. On one side, NN models do support multi-output regression tasks on their own, which might be solved defining an output layer similar to the one you built, namely with a number of nodes equal to the number of outputs (though, with respect to your construction, I would specify a linear activation with activation=None
rather than a sigmoid activation).
QUESTION
I am learning about multiclass classification using scikit learn. My goal is to develop a code which tries to include all the possible metrics needed to evaluate the classification. This is my code:
...ANSWER
Answered 2022-Feb-12 at 22:05The point of refit is that the model will be refitted using the best parameter set found before and the entire dataset. To find the best parameters, cross-validation is used which means that the dataset is always split into a training and a validation set, i.e. not the entire dataset is used for training here.
When you define multiple metrics, you have to tell scikit-learn how it should determine what best means for you. For convenience, you can just specify any of your scorers to be used as the decider so to say. In that case, the parameter set that maximizes this metric will be used for refitting.
If you want something more sophisticated, like taking the parameter set that returned the highest mean of all scorers, you have to pass a function to refit that given all the created metrics returns the index of the corresponding best parameter set. This parameter set will then be used to refit the model.
Those metrics will be passed as a dictionary of strings as keys and NumPy arrays as values. Those NumPy arrays have as many entries as parameter sets that have been evaluated. You find a lot of things in there. What is probably the most relevant is mean_test_*scorer-name*
. Those arrays contain for each tested parameter set the mean scorer-name-scorer computed across the cv splits.
In code, to get the index of the parameter set, that returns the highest mean across all scorers, you can do the following
QUESTION
I tried to construct a pipeline that has some optional steps. However, I would like to optimize hyperparameters for those steps as I want to get the best option between not using them and using them with different configurations (in my case SelectFromModel - sfm).
...ANSWER
Answered 2022-Jan-26 at 16:03Referring to this example you could just make a list of dictionaries. One containing sfm
and its related parameters and the other one not using "passthrough"
.
QUESTION
I'm writing TeX files on overleaf, and suddenly I got an error:
...ANSWER
Answered 2022-Jan-18 at 10:10The problem is the {\iffalse}\fi
in the abc
bib entry. This syntax makes no sense, just remove it.
QUESTION
We're taking a fresh look at how to review possible outliers in large data sets. We've sorted out some code for IQR and fences, MAD (Median Absolute Deviation), and Double MAD. Those three sound reasonably good at coping with series that include a lot of variabilities, but they're sensitive to asymmetry in the series. Our values are commonly skewed.
Doubled proves less susceptible as it splits the distribution in two and performs the MAD scoring on each half. So, points on either side of the overall median do not distort issues on the other side of the median. As I understand it, what I know comes from here:
https://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
All of these estimators depend on quantiles, and it sounds like the Harrell-Davis quantile estimator improves the quality of these other methods:
https://aakinshin.net/posts/harrell-davis-double-mad-outlier-detector/
MAD, DoubleMad, and Harrell-Davis seem to be widely used in the sciences, academia, and stats generally. You can get everything in R, but we're hoping to do some outlier checking directly in Postgres. (RDS deploy, no R.)
Does this ring a bell? Has anyone seen code like this for Postgres or any other SQL idiom?
And, not to give a misimpression, I'm not a stats person and have zero ability to translate greek formulas into SQL code. But, I can do okay translating between SQL idioms and following basic concepts.
...ANSWER
Answered 2022-Jan-15 at 05:14Now I know why people do this sort of work in R: Because R is fantastic for this kind of work. If anyone comes across this in the future, go get R. It's a compact, easy-to-use, easy-to-learn language with a great IDE.
If you've got a Postgres server where you can install PL/R, so much the better. PL/R is written to use the DBI
and RPostgreSQL
R packages to connect with Postgres. Meaning, you should be able to develop your code in RStudio, and then add the bits of wrapping required to make it run in PL/R within your Postgres server.
For outliers, I'm happy with univOutl
(Univariate Outliers) so far, which provides 10 common, and less common, methods.
QUESTION
I want to build a quantile regressor based on XGBRegressor, the scikit-learn wrapper class for XGBoost. I have the following two versions: the second version is simply trimmed from the first one, but it no longer works.
I am wondering why I need to put every parameters of XGBRegressor in its child class's initialization? What if I just want to take all the default parameter values except for max_depth?
(My XGBoost is of version 1.4.2.)
No.1 the full version that works as expected:
...ANSWER
Answered 2021-Dec-26 at 11:58I am not an expert with scikit-learn but it seems that one of the requirements of various objects used by this framework is that they can be cloned by calling the sklearn.base.clone method. This appears to be something that the existing XGBRegressor
class does, so is something your subclass of XGBRegressor
must also do.
What may help is to pass any other unexpected keyword arguments as a **kwargs
parameter. In your constructor, kwargs
will contain a dict of all of the other keyword parameters that weren't assigned to other constructor parameters. You can pass this dict of parameters on to the call to the superclass constructor by referring to them as **kwargs
again: this will cause Python to expand them out:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install estimators
You can use estimators like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page