LightGBM | high performance gradient boosting ( GBT GBDT GBRT | Machine Learning library

 by   microsoft C++ Version: 4.1.0 License: MIT

kandi X-RAY | LightGBM Summary

kandi X-RAY | LightGBM Summary

LightGBM is a C++ library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Pytorch applications. LightGBM has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

            kandi-support Support

              LightGBM has a medium active ecosystem.
              It has 15042 star(s) with 3730 fork(s). There are 443 watchers for this library.
              There were 2 major release(s) in the last 6 months.
              There are 262 open issues and 2739 have been closed. On average issues are closed in 62 days. There are 22 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of LightGBM is 4.1.0

            kandi-Quality Quality

              LightGBM has 0 bugs and 0 code smells.

            kandi-Security Security

              LightGBM has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              LightGBM code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              LightGBM is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              LightGBM releases are available to install and integrate.
              Installation instructions are available. Examples and code snippets are not available.
              It has 12630 lines of code, 565 functions and 39 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of LightGBM
            Get all kandi verified functions for this library.

            LightGBM Key Features

            No Key Features are available at this moment for LightGBM.

            LightGBM Examples and Code Snippets

            copy iconCopy
            import xgboost
            import shap
            # train an XGBoost model
            X, y =
            model = xgboost.XGBRegressor().fit(X, y)
            # explain the model's predictions using SHAP
            # (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark,   
            LightGBM - advanced example
            Pythondot img2Lines of Code : 133dot img2License : Permissive (MIT License)
            copy iconCopy
            # coding: utf-8
            import copy
            import json
            import pickle
            from pathlib import Path
            import numpy as np
            import pandas as pd
            from sklearn.metrics import roc_auc_score
            import lightgbm as lgb
            print('Loading data...')
            # load or create your dataset
            LightGBM - dataset from multi hdf5
            Pythondot img3Lines of Code : 58dot img3License : Permissive (MIT License)
            copy iconCopy
            from pathlib import Path
            import h5py
            import numpy as np
            import pandas as pd
            import lightgbm as lgb
            class HDFSequence(lgb.Sequence):
                def __init__(self, hdf_dataset, batch_size):
                    Construct a sequence object from HDF5 with re  
            LightGBM - sklearn example
            Pythondot img4Lines of Code : 55dot img4License : Permissive (MIT License)
            copy iconCopy
            # coding: utf-8
            from pathlib import Path
            import numpy as np
            import pandas as pd
            from sklearn.metrics import mean_squared_error
            from sklearn.model_selection import GridSearchCV
            import lightgbm as lgb
            print('Loading data...')
            # load or create your d  
            copy iconCopy
            import lightgbm as lgb
            from sklearn.datasets import make_blobs
            X, y = make_blobs(
            params = {
                "objective": "binary",
                "min_data_in_leaf": 5,
            How to output Shap values in probability and make force_plot from binary classifier
            Pythondot img6Lines of Code : 78dot img6License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            import pandas as pd
            import numpy as np
            import shap
            import lightgbm as lgbm
            from sklearn.model_selection import train_test_split
            from sklearn.datasets import load_breast_cancer
            from scipy.special import expit
            data = load_bre
            How to filter redundant features using shap.utils.hclust not only by visual inspection barplot?
            Pythondot img7Lines of Code : 17dot img7License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            Distances are measured by training univariate XGBoost models 
            of y for all the features, and then predicting the output of these
            models using univariate XGBoost models of other features. If one 
            feature can effectively predict the output o
            Why LightGBM Python-package gives bad prediction using for regression task?
            Pythondot img8Lines of Code : 32dot img8License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            from lightgbm import LGBMRegressor
            from sklearn.metrics import r2_score
            from sklearn.datasets import make_regression
            # 20-row input data
            X, y = make_regression(
            How indexing of sliced data frame works in pandas
            Pythondot img9Lines of Code : 5dot img9License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            >>> idx_of_first_uncertainty_row = model_uncertain.iloc[0].index
            >>> row_in_test_data = X.loc[idx_of_first_uncertainty_row]
            >>> np.isclose(model_uncertain.iloc[0].values, X_train.loc[3].v
            Python - decision tree in lightgbm with odd values
            Pythondot img10Lines of Code : 20dot img10License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            import lightgbm as lgb
            import numpy as np
            import pandas as pd
            X = np.linspace(1, 2, 100)[:, None]
            y = X[:, 0]**2
            ds = lgb.Dataset(X, y)
            params = {'num_leaves': 100, 'min_child_samples': 1, 'min_data_in_bin': 1, 'learning_rate': 1}
            bst = l

            Community Discussions


            How to train with TimeSeriesSplit from sklearn?
            Asked 2022-Mar-28 at 20:56

            I have this kind of data (columns):



            Answered 2022-Mar-28 at 20:56

            Assuming you have splits df based on this question. First save indices for each Fold into arrays of tuples (train,test), i.e,:



            How come you can get a permutation feature importance greater than 1?
            Asked 2022-Mar-26 at 20:09

            Take this simple code:



            Answered 2022-Mar-26 at 20:09

            But I thought that the permutation_importance mean for a feature was the amount that the score was changed on average by permuting the feature column[...]


            so this can't be more than 1 can it?

            That depends on whether the score can "worsen" by more than 1. The default for the scoring parameter of permutation_importance is None, which uses the model's score function. For LGBMRegressor (and most regressors), that's the R2 score, which has a maximum of 1 but can take arbitrarily large negative values, so indeed the score can worsen by an arbitrarily large amount.



            How to output Shap values in probability and make force_plot from binary classifier
            Asked 2022-Mar-22 at 03:32

            I need to plot how each feature impacts the predicted probability for each sample from my LightGBM binary classifier. So I need to output Shap values in probability, instead of normal Shap values. It does not appear to have any options to output in term of probability.

            The example code below is what I use to generate dataframe of Shap values and do a force_plot for the first data sample. Does anyone know how I should modify the code to change the output? I'm new to Shap value and the Shap package. Thanks a lot in advance.



            Answered 2022-Mar-14 at 13:40

            You can consider running your output values through a softmax() function. For reference, it is defined as :



            BayesianOptimization fails due to float error
            Asked 2022-Mar-21 at 22:34

            I want to optimize my HPO of my lightgbm model. I used a Bayesian Optimization process to do so. Sadly my algorithm fails to converge.




            Answered 2022-Mar-21 at 22:34

            This is related to a change in scipy 1.8.0, One should use -np.squeeze( instead of[0]


            The comments in the bug report indicate reverting to scipy 1.7.0 fixes this,

            It seems the fix is been proposed in the BayesianOptimization package:

            But this has not been merged and released yet, so you could either:



            How to filter redundant features using shap.utils.hclust not only by visual inspection barplot?
            Asked 2022-Mar-20 at 16:26

            I'm using shap.utils.hclust to figure out which features are redundant and following the documentation

            Reproducible example:



            Answered 2022-Mar-20 at 16:16
            1. Underneath, even tree models for classification are regression tasks. SHAP calls it "raw" feature output space, Tensorflow would call it logits. To convert raw to proba space sigmoid or softmax are used. So, answering your first question:



            Parallelizing Python code on Azure Databricks
            Asked 2022-Mar-07 at 22:17

            I'm trying to port over some "parallel" Python code to Azure Databricks. The code runs perfectly fine locally, but somehow doesn't on Azure Databricks. The code leverages the multiprocessing library, and more specifically the starmap function.

            The code goes like this:



            Answered 2021-Aug-22 at 09:31

            You should stop trying to invent the wheel, and instead start to leverage the built-in capabilities of Azure Databricks. Because Apache Spark (and Databricks) is the distributed system, machine learning on it should be also distributed. There are two approaches to that:

            1. Training algorithm is implemented in the distributed fashion - there is a number of such algorithms packaged into Apache Spark and included into Databricks Runtimes

            2. Use machine learning implementations designed to run on a single node, but train multiple models in parallel - that what typically happens during hyper-parameters optimization. And what is you're trying to do

            Databricks runtime for machine learning includes the Hyperopt library that is designed for the efficient finding of best hyper-parameters without trying all combinations of the parameters, that allows to find them faster. It also include the SparkTrials API that is designed to parallelize computations for single-machine ML models such as scikit-learn. Documentation includes a number of examples of using that library with single-node ML algorithms, that you can use as a base for your work - for example, here is an example for scikit-learn.

            P.S. When you're running the code with multiprocessing, then the code is executed only on the driver node, and the rest of the cluster isn't utilized at all.



            How indexing of sliced data frame works in pandas
            Asked 2022-Mar-05 at 11:56

            How to get the correct row from a datfarme which is sliced?

            To show what I mean, look at this code sample:



            Answered 2022-Mar-05 at 11:56

            How can I get the row in the X_test dataframe which is related to the first raw in model_uncertain data frame?

            You're on the right track:



            How to integrate pipeline fitting and hyperparameter optimisation in AWS Sagemaker?
            Asked 2022-Feb-25 at 12:57

            Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api. To clarify:

            1. I have to use LightGBM in general, there is no option here.
            2. The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
            3. I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.

            Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.

            Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator API from that package does not support on-the-fly cluster configuration either.

            Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?

            One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.

            Has anyone done anything similar in the past?




            Answered 2022-Feb-25 at 12:57

            Thanks for your questions. Here are answers:

            • SageMaker PySpark SDK does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.

            • Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:

              • SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)

              • SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container

            Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:

            • each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
            • use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)

            In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.



            Is it possible to get the number of rows of the training set from a LGBMClassifier?
            Asked 2022-Feb-16 at 07:37

            I have trained a model using lightgbm.sklearn.LGBMClassifier from the lightgbmpackage. I can find out the number of columns and column names of the training data from the model but I have not found a way to find the row number of the training data. Is it possible to do so? The best solution would be to obtain the training data from the model but I have not come across anything like that.



            Answered 2022-Feb-16 at 02:48

            The tree structure of a LightGBM model includes information about how many records from the training data would fall into each node in the tree if that node were a leaf node. In LightGBM's code, this value is called internal_count.

            Since all data matches the root node of each tree, in most situations you can use that information to figure out, given a LightGBM model, how many instances were in the training data.

            Consider the following example, using lightgbm==3.3.2 and Python 3.8.8.



            How to pass pandas dataframe to airflow tasks
            Asked 2022-Jan-25 at 15:57

            I'm learning how to use airflow to build machine learning pipeline.

            But didn't find a way to pass pandas dataframe generated from 1 task into another task... It seems that need to convert the data to JSON format or save the data in database within each task?

            Finally, I had to put everything in 1 task... Is there anyway to pass dataframe between airflow tasks?

            Here's my code:



            Answered 2021-Nov-08 at 09:59

            Although it is used in many ETL tasks, Airflow is not the right choice for that kind of operations, it is intended for workflow not dataflow. But there are many ways to do that without passing the whole dataframe between tasks.

            You can pass information about the data using xcom.push and xcom.pull:

            a. Save the outcome of the first task somewhere (json, csv, etc.)

            b. Pass to xcom.push information about saved file. E.g. file name, path.

            c. Read this filename using xcom.pull from the other task and perform needed operation.


            Everything above using some database tables:

            a. In task_1 you can download data from table_1 in some dataframe, process it and save in another table_2 (df.to_sql()).

            b. Pass the name of the table using xcom.push.

            c. From the other task get table_2 using xcom.pull and read it with df.read_sql().

            Information on how to use xcom you can get from airflow examples. Example:

            IMHO there are many other better ways, I have just written what I tried.


            Community Discussions, Code Snippets contain sources that include Stack Exchange Network


            No vulnerabilities reported

            Install LightGBM

            Our primary documentation is at and is generated from this repository. If you are new to LightGBM, follow [the installation instructions]( on that site.
            [Examples]( showing command line usage of common tasks.
            [Features]( and algorithms supported by LightGBM.
            [Parameters]( is an exhaustive list of customization you can make.
            [Distributed Learning]( and [GPU Learning]( can speed up computation.
            [Laurae++ interactive documentation]( is a detailed guide for hyperparameters.
            [FLAML]( provides automated tuning for LightGBM ([code examples](
            [Optuna Hyperparameter Tuner]( provides automated tuning for LightGBM hyperparameters ([code examples](
            [Understanding LightGBM Parameters (and How to Tune Them using Neptune)](
            [How we update](
            Check out the [Development Guide](


            Our primary documentation is at and is generated from this repository. If you are new to LightGBM, follow [the installation instructions]( on that site.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
          • PyPI

            pip install lightgbm

          • CLONE
          • HTTPS


          • CLI

            gh repo clone microsoft/LightGBM

          • sshUrl


          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link