LightGBM | high performance gradient boosting ( GBT GBDT GBRT | Machine Learning library
kandi X-RAY | LightGBM Summary
kandi X-RAY | LightGBM Summary
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of LightGBM
LightGBM Key Features
LightGBM Examples and Code Snippets
import xgboost
import shap
# train an XGBoost model
X, y = shap.datasets.boston()
model = xgboost.XGBRegressor().fit(X, y)
# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark,
# coding: utf-8
import copy
import json
import pickle
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
print('Loading data...')
# load or create your dataset
binary_ex
from pathlib import Path
import h5py
import numpy as np
import pandas as pd
import lightgbm as lgb
class HDFSequence(lgb.Sequence):
def __init__(self, hdf_dataset, batch_size):
"""
Construct a sequence object from HDF5 with re
# coding: utf-8
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
import lightgbm as lgb
print('Loading data...')
# load or create your d
import lightgbm as lgb
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=1000,
n_features=5,
centers=2,
random_state=708
)
params = {
"objective": "binary",
"min_data_in_leaf": 5,
"min_data_i
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit
shap.initjs()
data = load_bre
Distances are measured by training univariate XGBoost models
of y for all the features, and then predicting the output of these
models using univariate XGBoost models of other features. If one
feature can effectively predict the output o
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
# 20-row input data
X, y = make_regression(
n_samples=20,
n_informative=5,
n_features=5,
random_state=708
>>> idx_of_first_uncertainty_row = model_uncertain.iloc[0].index
>>> row_in_test_data = X.loc[idx_of_first_uncertainty_row]
>>> np.isclose(model_uncertain.iloc[0].values, X_train.loc[3].v
import lightgbm as lgb
import numpy as np
import pandas as pd
X = np.linspace(1, 2, 100)[:, None]
y = X[:, 0]**2
ds = lgb.Dataset(X, y)
params = {'num_leaves': 100, 'min_child_samples': 1, 'min_data_in_bin': 1, 'learning_rate': 1}
bst = l
Community Discussions
Trending Discussions on LightGBM
QUESTION
I have this kind of data (columns):
...ANSWER
Answered 2022-Mar-28 at 20:56Assuming you have splits
df based on this question.
First save indices for each Fold into arrays of tuples (train,test), i.e,:
QUESTION
Take this simple code:
...ANSWER
Answered 2022-Mar-26 at 20:09But I thought that the permutation_importance mean for a feature was the amount that the score was changed on average by permuting the feature column[...]
Correct.
so this can't be more than 1 can it?
That depends on whether the score can "worsen" by more than 1. The default for the scoring
parameter of permutation_importance
is None
, which uses the model's score
function. For LGBMRegressor
(and most regressors), that's the R2 score, which has a maximum of 1 but can take arbitrarily large negative values, so indeed the score can worsen by an arbitrarily large amount.
QUESTION
I need to plot how each feature impacts the predicted probability for each sample from my LightGBM
binary classifier. So I need to output Shap values in probability, instead of normal Shap values. It does not appear to have any options to output in term of probability.
The example code below is what I use to generate dataframe of Shap values and do a force_plot
for the first data sample. Does anyone know how I should modify the code to change the output?
I'm new to Shap value and the Shap package. Thanks a lot in advance.
ANSWER
Answered 2022-Mar-14 at 13:40You can consider running your output values through a softmax() function. For reference, it is defined as :
QUESTION
I want to optimize my HPO of my lightgbm model. I used a Bayesian Optimization process to do so. Sadly my algorithm fails to converge.
MRE
...ANSWER
Answered 2022-Mar-21 at 22:34This is related to a change in scipy 1.8.0,
One should use -np.squeeze(res.fun)
instead of -res.fun[0]
https://github.com/fmfn/BayesianOptimization/issues/300
The comments in the bug report indicate reverting to scipy 1.7.0 fixes this,
It seems the fix is been proposed in the BayesianOptimization package: https://github.com/fmfn/BayesianOptimization/pull/303
But this has not been merged and released yet, so you could either:
- fall back to scipy 1.7.0
- use the forked github version of BayesianOptimization with the patch (https://github.com/samFarrellDay/BayesianOptimization)
- apply the patch in issue 303 manually on your system
QUESTION
I'm using shap.utils.hclust
to figure out which features are redundant and following the documentation
Reproducible example:
...ANSWER
Answered 2022-Mar-20 at 16:16- Underneath, even tree models for classification are regression tasks. SHAP calls it "raw" feature output space, Tensorflow would call it logits. To convert raw to proba space sigmoid or softmax are used. So, answering your first question:
QUESTION
I'm trying to port over some "parallel" Python code to Azure Databricks. The code runs perfectly fine locally, but somehow doesn't on Azure Databricks. The code leverages the multiprocessing
library, and more specifically the starmap
function.
The code goes like this:
...ANSWER
Answered 2021-Aug-22 at 09:31You should stop trying to invent the wheel, and instead start to leverage the built-in capabilities of Azure Databricks. Because Apache Spark (and Databricks) is the distributed system, machine learning on it should be also distributed. There are two approaches to that:
Training algorithm is implemented in the distributed fashion - there is a number of such algorithms packaged into Apache Spark and included into Databricks Runtimes
Use machine learning implementations designed to run on a single node, but train multiple models in parallel - that what typically happens during hyper-parameters optimization. And what is you're trying to do
Databricks runtime for machine learning includes the Hyperopt library that is designed for the efficient finding of best hyper-parameters without trying all combinations of the parameters, that allows to find them faster. It also include the SparkTrials API that is designed to parallelize computations for single-machine ML models such as scikit-learn. Documentation includes a number of examples of using that library with single-node ML algorithms, that you can use as a base for your work - for example, here is an example for scikit-learn.
P.S. When you're running the code with multiprocessing, then the code is executed only on the driver node, and the rest of the cluster isn't utilized at all.
QUESTION
How to get the correct row from a datfarme which is sliced?
To show what I mean, look at this code sample:
...ANSWER
Answered 2022-Mar-05 at 11:56How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?
You're on the right track:
QUESTION
Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api. To clarify:
- I have to use LightGBM in general, there is no option here.
- The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
- I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.
Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator
API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.
Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator
API from that package does not support on-the-fly cluster configuration either.
Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?
One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.
Has anyone done anything similar in the past?
Thanks
...ANSWER
Answered 2022-Feb-25 at 12:57Thanks for your questions. Here are answers:
SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container
Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:
- each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
- use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)
In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.
QUESTION
I have trained a model using lightgbm.sklearn.LGBMClassifier
from the lightgbm
package. I can find out the number of columns and column names of the training data from the model but I have not found a way to find the row number of the training data. Is it possible to do so? The best solution would be to obtain the training data from the model but I have not come across anything like that.
ANSWER
Answered 2022-Feb-16 at 02:48The tree structure of a LightGBM model includes information about how many records from the training data would fall into each node in the tree if that node were a leaf node. In LightGBM's code, this value is called internal_count
.
Since all data matches the root node of each tree, in most situations you can use that information to figure out, given a LightGBM model, how many instances were in the training data.
Consider the following example, using lightgbm==3.3.2
and Python 3.8.8.
QUESTION
I'm learning how to use airflow to build machine learning pipeline.
But didn't find a way to pass pandas dataframe generated from 1 task into another task... It seems that need to convert the data to JSON format or save the data in database within each task?
Finally, I had to put everything in 1 task... Is there anyway to pass dataframe between airflow tasks?
Here's my code:
...ANSWER
Answered 2021-Nov-08 at 09:59Although it is used in many ETL tasks, Airflow is not the right choice for that kind of operations, it is intended for workflow not dataflow. But there are many ways to do that without passing the whole dataframe between tasks.
You can pass information about the data using xcom.push and xcom.pull:
a. Save the outcome of the first task somewhere (json, csv, etc.)
b. Pass to xcom.push information about saved file. E.g. file name, path.
c. Read this filename using xcom.pull from the other task and perform needed operation.
Or:
Everything above using some database tables:
a. In task_1 you can download data from table_1 in some dataframe, process it and save in another table_2 (df.to_sql()).
b. Pass the name of the table using xcom.push.
c. From the other task get table_2 using xcom.pull and read it with df.read_sql().
Information on how to use xcom you can get from airflow examples. Example: https://github.com/apache/airflow/blob/main/airflow/example_dags/tutorial_etl_dag.py
IMHO there are many other better ways, I have just written what I tried.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install LightGBM
[Examples](https://github.com/microsoft/LightGBM/tree/master/examples) showing command line usage of common tasks.
[Features](https://github.com/microsoft/LightGBM/blob/master/docs/Features.rst) and algorithms supported by LightGBM.
[Parameters](https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst) is an exhaustive list of customization you can make.
[Distributed Learning](https://github.com/microsoft/LightGBM/blob/master/docs/Parallel-Learning-Guide.rst) and [GPU Learning](https://github.com/microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst) can speed up computation.
[Laurae++ interactive documentation](https://sites.google.com/view/lauraepp/parameters) is a detailed guide for hyperparameters.
[FLAML](https://www.microsoft.com/en-us/research/project/fast-and-lightweight-automl-for-large-scale-data/articles/flaml-a-fast-and-lightweight-automl-library/) provides automated tuning for LightGBM ([code examples](https://microsoft.github.io/FLAML/docs/Examples/AutoML-for-LightGBM/)).
[Optuna Hyperparameter Tuner](https://medium.com/optuna/lightgbm-tuner-new-optuna-integration-for-hyperparameter-optimization-8b7095e99258) provides automated tuning for LightGBM hyperparameters ([code examples](https://github.com/optuna/optuna/tree/master/examples/lightgbm)).
[Understanding LightGBM Parameters (and How to Tune Them using Neptune)](https://neptune.ai/blog/lightgbm-parameters-guide).
[How we update readthedocs.io](https://github.com/microsoft/LightGBM/blob/master/docs/README.rst).
Check out the [Development Guide](https://github.com/microsoft/LightGBM/blob/master/docs/Development-Guide.rst).
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page