LightGBM | high performance gradient boosting ( GBT GBDT GBRT | Machine Learning library

by microsoft C++ Version: 4.4.0 License: MIT

X-Ray Key Features Code Snippets(10)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | LightGBM Summary

LightGBM is a C++ library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Pytorch applications. LightGBM has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Support

Quality

Security

License

Reuse

Support

LightGBM has a medium active ecosystem.

It has 15042 star(s) with 3730 fork(s). There are 443 watchers for this library.

It had no major release in the last 12 months.

There are 262 open issues and 2739 have been closed. On average issues are closed in 62 days. There are 22 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of LightGBM is 4.4.0

Quality

LightGBM has 0 bugs and 0 code smells.

Security

LightGBM has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

LightGBM code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

LightGBM is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

LightGBM releases are available to install and integrate.

Installation instructions are available. Examples and code snippets are not available.

It has 12630 lines of code, 565 functions and 39 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of LightGBM

Get all kandi verified functions for this library.

LightGBM Key Features

No Key Features are available at this moment for LightGBM.

LightGBM Examples and Code Snippets

Tree ensemble example (XGBoost/LightGBM/CatBoost/scikit-learn/pyspark models)

pypi

Lines of Code : 23

License : No License

Copy

import xgboost
import shap

# train an XGBoost model
X, y = shap.datasets.boston()
model = xgboost.XGBRegressor().fit(X, y)

# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark,

LightGBM - advanced example

Python

Lines of Code : 133

License : Permissive (MIT License)

Copy

# coding: utf-8
import copy
import json
import pickle
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score

import lightgbm as lgb

print('Loading data...')
# load or create your dataset
binary_ex

LightGBM - dataset from multi hdf5

Python

Lines of Code : 58

License : Permissive (MIT License)

Copy

from pathlib import Path

import h5py
import numpy as np
import pandas as pd

import lightgbm as lgb


class HDFSequence(lgb.Sequence):
    def __init__(self, hdf_dataset, batch_size):
        """
        Construct a sequence object from HDF5 with re

LightGBM - sklearn example

Python

Lines of Code : 55

License : Permissive (MIT License)

Copy

# coding: utf-8
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

import lightgbm as lgb

print('Loading data...')
# load or create your d

Why LightGBM with 'objective': 'binary' donot return binary value 0 and 1 when call method predict?

Python

Lines of Code : 39

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import lightgbm as lgb
from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=1000,
    n_features=5,
    centers=2,
    random_state=708
)
params = {
    "objective": "binary",
    "min_data_in_leaf": 5,
    "min_data_i

How to output Shap values in probability and make force_plot from binary classifier

Python

Lines of Code : 78

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit

shap.initjs()

data = load_bre

How to filter redundant features using shap.utils.hclust not only by visual inspection barplot?

Python

Lines of Code : 17

License : Strong Copyleft (CC BY-SA 4.0)

Copy

Distances are measured by training univariate XGBoost models 
of y for all the features, and then predicting the output of these
models using univariate XGBoost models of other features. If one 
feature can effectively predict the output o

Why LightGBM Python-package gives bad prediction using for regression task?

Python

Lines of Code : 32

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression

# 20-row input data
X, y = make_regression(
    n_samples=20,
    n_informative=5,
    n_features=5,
    random_state=708

How indexing of sliced data frame works in pandas

Python

Lines of Code : 5

License : Strong Copyleft (CC BY-SA 4.0)

Copy

>>> idx_of_first_uncertainty_row = model_uncertain.iloc[0].index
>>> row_in_test_data = X.loc[idx_of_first_uncertainty_row]

>>> np.isclose(model_uncertain.iloc[0].values, X_train.loc[3].v

Python - decision tree in lightgbm with odd values

Python

Lines of Code : 20

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import lightgbm as lgb
import numpy as np
import pandas as pd

X = np.linspace(1, 2, 100)[:, None]
y = X[:, 0]**2
ds = lgb.Dataset(X, y)
params = {'num_leaves': 100, 'min_child_samples': 1, 'min_data_in_bin': 1, 'learning_rate': 1}
bst = l

Community Discussions

Trending Discussions on LightGBM

How to train with TimeSeriesSplit from sklearn?

How come you can get a permutation feature importance greater than 1?

How to output Shap values in probability and make force_plot from binary classifier

BayesianOptimization fails due to float error

How to filter redundant features using shap.utils.hclust not only by visual inspection barplot?

Parallelizing Python code on Azure Databricks

How indexing of sliced data frame works in pandas

How to integrate spark.ml pipeline fitting and hyperparameter optimisation in AWS Sagemaker?

Is it possible to get the number of rows of the training set from a LGBMClassifier?

How to pass pandas dataframe to airflow tasks

QUESTION

How to train with TimeSeriesSplit from sklearn?

Asked 2022-Mar-28 at 20:56

I have this kind of data (columns):

...

ANSWER

Answered 2022-Mar-28 at 20:56

Assuming you have splits df based on this question. First save indices for each Fold into arrays of tuples (train,test), i.e,:

Source https://stackoverflow.com/questions/71579106

QUESTION

How come you can get a permutation feature importance greater than 1?

Asked 2022-Mar-26 at 20:09

Take this simple code:

...

ANSWER

Answered 2022-Mar-26 at 20:09

But I thought that the permutation_importance mean for a feature was the amount that the score was changed on average by permuting the feature column[...]

Correct.

so this can't be more than 1 can it?

That depends on whether the score can "worsen" by more than 1. The default for the scoring parameter of permutation_importance is None, which uses the model's score function. For LGBMRegressor (and most regressors), that's the R2 score, which has a maximum of 1 but can take arbitrarily large negative values, so indeed the score can worsen by an arbitrarily large amount.

Source https://stackoverflow.com/questions/71618530

QUESTION

How to output Shap values in probability and make force_plot from binary classifier

Asked 2022-Mar-22 at 03:32

I need to plot how each feature impacts the predicted probability for each sample from my LightGBM binary classifier. So I need to output Shap values in probability, instead of normal Shap values. It does not appear to have any options to output in term of probability.

The example code below is what I use to generate dataframe of Shap values and do a force_plot for the first data sample. Does anyone know how I should modify the code to change the output? I'm new to Shap value and the Shap package. Thanks a lot in advance.

...

ANSWER

Answered 2022-Mar-14 at 13:40

You can consider running your output values through a softmax() function. For reference, it is defined as :

Source https://stackoverflow.com/questions/71446065

QUESTION

BayesianOptimization fails due to float error

Asked 2022-Mar-21 at 22:34

I want to optimize my HPO of my lightgbm model. I used a Bayesian Optimization process to do so. Sadly my algorithm fails to converge.

MRE

...

ANSWER

Answered 2022-Mar-21 at 22:34

This is related to a change in scipy 1.8.0, One should use -np.squeeze(res.fun) instead of -res.fun[0]

https://github.com/fmfn/BayesianOptimization/issues/300

The comments in the bug report indicate reverting to scipy 1.7.0 fixes this,

It seems the fix is been proposed in the BayesianOptimization package: https://github.com/fmfn/BayesianOptimization/pull/303

But this has not been merged and released yet, so you could either:

fall back to scipy 1.7.0
use the forked github version of BayesianOptimization with the patch (https://github.com/samFarrellDay/BayesianOptimization)
apply the patch in issue 303 manually on your system

Source https://stackoverflow.com/questions/71460894

QUESTION

How to filter redundant features using shap.utils.hclust not only by visual inspection barplot?

Asked 2022-Mar-20 at 16:26

I'm using shap.utils.hclust to figure out which features are redundant and following the documentation

Reproducible example:

...

ANSWER

Answered 2022-Mar-20 at 16:16

Underneath, even tree models for classification are regression tasks. SHAP calls it "raw" feature output space, Tensorflow would call it logits. To convert raw to proba space sigmoid or softmax are used. So, answering your first question:

Source https://stackoverflow.com/questions/71534923

QUESTION

Parallelizing Python code on Azure Databricks

Asked 2022-Mar-07 at 22:17

I'm trying to port over some "parallel" Python code to Azure Databricks. The code runs perfectly fine locally, but somehow doesn't on Azure Databricks. The code leverages the multiprocessing library, and more specifically the starmap function.

The code goes like this:

...

ANSWER

Answered 2021-Aug-22 at 09:31

You should stop trying to invent the wheel, and instead start to leverage the built-in capabilities of Azure Databricks. Because Apache Spark (and Databricks) is the distributed system, machine learning on it should be also distributed. There are two approaches to that:

Training algorithm is implemented in the distributed fashion - there is a number of such algorithms packaged into Apache Spark and included into Databricks Runtimes
Use machine learning implementations designed to run on a single node, but train multiple models in parallel - that what typically happens during hyper-parameters optimization. And what is you're trying to do

Databricks runtime for machine learning includes the Hyperopt library that is designed for the efficient finding of best hyper-parameters without trying all combinations of the parameters, that allows to find them faster. It also include the SparkTrials API that is designed to parallelize computations for single-machine ML models such as scikit-learn. Documentation includes a number of examples of using that library with single-node ML algorithms, that you can use as a base for your work - for example, here is an example for scikit-learn.

P.S. When you're running the code with multiprocessing, then the code is executed only on the driver node, and the rest of the cluster isn't utilized at all.

Source https://stackoverflow.com/questions/68849916

QUESTION

How indexing of sliced data frame works in pandas

Asked 2022-Mar-05 at 11:56

How to get the correct row from a datfarme which is sliced?

To show what I mean, look at this code sample:

...

ANSWER

Answered 2022-Mar-05 at 11:56

How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?

You're on the right track:

Source https://stackoverflow.com/questions/71323185

QUESTION

How to integrate spark.ml pipeline fitting and hyperparameter optimisation in AWS Sagemaker?

Asked 2022-Feb-25 at 12:57

Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api. To clarify:

I have to use LightGBM in general, there is no option here.
The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.

Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.

Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator API from that package does not support on-the-fly cluster configuration either.

Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?

One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.

Has anyone done anything similar in the past?

Thanks

...

ANSWER

Answered 2022-Feb-25 at 12:57

Thanks for your questions. Here are answers:

SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
- SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
- SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container

Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:

each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)

In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.

Source https://stackoverflow.com/questions/70835006

QUESTION

Is it possible to get the number of rows of the training set from a LGBMClassifier?

Asked 2022-Feb-16 at 07:37

I have trained a model using lightgbm.sklearn.LGBMClassifier from the lightgbmpackage. I can find out the number of columns and column names of the training data from the model but I have not found a way to find the row number of the training data. Is it possible to do so? The best solution would be to obtain the training data from the model but I have not come across anything like that.

...

ANSWER

Answered 2022-Feb-16 at 02:48

The tree structure of a LightGBM model includes information about how many records from the training data would fall into each node in the tree if that node were a leaf node. In LightGBM's code, this value is called internal_count.

Since all data matches the root node of each tree, in most situations you can use that information to figure out, given a LightGBM model, how many instances were in the training data.

Consider the following example, using lightgbm==3.3.2 and Python 3.8.8.

Source https://stackoverflow.com/questions/71110907

QUESTION

How to pass pandas dataframe to airflow tasks

Asked 2022-Jan-25 at 15:57

I'm learning how to use airflow to build machine learning pipeline.

But didn't find a way to pass pandas dataframe generated from 1 task into another task... It seems that need to convert the data to JSON format or save the data in database within each task?

Finally, I had to put everything in 1 task... Is there anyway to pass dataframe between airflow tasks?

Here's my code:

...

ANSWER

Answered 2021-Nov-08 at 09:59

Although it is used in many ETL tasks, Airflow is not the right choice for that kind of operations, it is intended for workflow not dataflow. But there are many ways to do that without passing the whole dataframe between tasks.

You can pass information about the data using xcom.push and xcom.pull:

a. Save the outcome of the first task somewhere (json, csv, etc.)

b. Pass to xcom.push information about saved file. E.g. file name, path.

c. Read this filename using xcom.pull from the other task and perform needed operation.

Or:

Everything above using some database tables:

a. In task_1 you can download data from table_1 in some dataframe, process it and save in another table_2 (df.to_sql()).

b. Pass the name of the table using xcom.push.

c. From the other task get table_2 using xcom.pull and read it with df.read_sql().

Information on how to use xcom you can get from airflow examples. Example: https://github.com/apache/airflow/blob/main/airflow/example_dags/tutorial_etl_dag.py

IMHO there are many other better ways, I have just written what I tried.

Source https://stackoverflow.com/questions/69868258

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install LightGBM

Our primary documentation is at https://lightgbm.readthedocs.io/ and is generated from this repository. If you are new to LightGBM, follow [the installation instructions](https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html) on that site.
[Examples](https://github.com/microsoft/LightGBM/tree/master/examples) showing command line usage of common tasks.
[Features](https://github.com/microsoft/LightGBM/blob/master/docs/Features.rst) and algorithms supported by LightGBM.
[Parameters](https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst) is an exhaustive list of customization you can make.
[Distributed Learning](https://github.com/microsoft/LightGBM/blob/master/docs/Parallel-Learning-Guide.rst) and [GPU Learning](https://github.com/microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst) can speed up computation.
[Laurae++ interactive documentation](https://sites.google.com/view/lauraepp/parameters) is a detailed guide for hyperparameters.
[FLAML](https://www.microsoft.com/en-us/research/project/fast-and-lightweight-automl-for-large-scale-data/articles/flaml-a-fast-and-lightweight-automl-library/) provides automated tuning for LightGBM ([code examples](https://microsoft.github.io/FLAML/docs/Examples/AutoML-for-LightGBM/)).
[Optuna Hyperparameter Tuner](https://medium.com/optuna/lightgbm-tuner-new-optuna-integration-for-hyperparameter-optimization-8b7095e99258) provides automated tuning for LightGBM hyperparameters ([code examples](https://github.com/optuna/optuna/tree/master/examples/lightgbm)).
[Understanding LightGBM Parameters (and How to Tune Them using Neptune)](https://neptune.ai/blog/lightgbm-parameters-guide).
[How we update readthedocs.io](https://github.com/microsoft/LightGBM/blob/master/docs/README.rst).
Check out the [Development Guide](https://github.com/microsoft/LightGBM/blob/master/docs/Development-Guide.rst).