# LightGBM大战XGBoost，谁将夺得桂冠？_技术教程

# LightGBM大战XGBoost，谁将夺得桂冠？

发布时间：2025-09-14

点击量：

0.引言如果你活跃于机器学习社区，你一定了解提升机器（Boosting Machine）及其能力。从AdaBoost发展到如今广受欢迎的XGBoost，XGBoost已成为Kaggle比赛中公认的获胜算法。这是因为它极其强大。然而，当数据量极大时，XGBoost的训练时间也会变得很长。

大多数人可能对Light Gradient Boosting不太熟悉，但读完本文后，你将对其有深入了解。一个自然的问题会浮现：为什么会出现另一个提升机器算法？它比XGBoost更好吗？

注意：本文假设读者对GBMs和XGBoost算法有一定的了解。如果你不了解它们，请先学习它们的原理再阅读本文。

什么是LightGBMLightGBM是一个快速的、分布式的、高性能的基于决策树算法的梯度提升框架，适用于排序、分类、回归及其他许多机器学习任务。

由于它基于决策树算法，LightGBM采用最优的leaf-wise策略来分裂叶子节点，而其他提升算法通常采用depth-wise或level-wise方法。因此，在增长到相同的叶子节点时，leaf-wise算法比level-wise算法减少更多的损失，从而带来更高的精度，同时速度也令人惊叹，这就是LightGBM名字中“Light”的由来。

上述是LightGBM算法作者对其不同之处的简要概述。

XGBoost中决策树的增长方式示意图![image.png-17.6kB][1]

LightGBM中决策树的增长方式示意图undefined

Leaf-Wise分裂会增加复杂性，并可能导致过拟合，但可以通过设置max-depth参数来克服这一问题，限制树的最大深度。

接下来，我们将介绍如何安装LightGBM并使用它运行一个模型。我们将通过实验结果对比LightGBM和XGBoost，证明你应该以一种轻便的方式（Light Manner）使用LightGBM。

LightGBM的优势首先让我们看看LightGBM的优势。

更快的训练速度和更高的效率：LightGBM使用基于直方图的算法。例如，它将连续的特征值分桶（buckets）装进离散的箱子（bins），这使得训练过程更快。更低的内存占用：使用离散的箱子（bins）保存并替换连续值，导致更少的内存占用。更高的准确率（相比于其他任何提升算法）：通过leaf-wise分裂方法产生比level-wise分裂方法更复杂的树，这是实现更高准确率的主要因素。然而，它有时会导致过拟合，但我们可以通过设置max-depth参数来防止过拟合。大数据处理能力：由于其在训练时间上的缩减，LightGBM也具备处理大数据的能力。支持并行学习3. 安装LightGBM本节介绍如何在各种操作系统下安装LightGBM。由于桌面系统中最常用的操作系统是Windows、Linux和macOS，我们将依次介绍如何在这三种系统上安装LightGBM。

3.1 Windows对于Windows操作系统，由于其并非开源操作系统，开发者一直面临挑战。我们需要安装相应的编译环境才能编译LightGBM源代码。对于Windows下的底层C/C++编译环境，主要有微软的Visual Studio（或MSBuild）和开源的MinGW64，下面我们分别介绍这两种编译环境下的LightGBM安装。

注意，对于以下两种编译环境，我们都需要确保系统已安装Windows下的Git和CMake工具。

3.1.1 基于Visual Studio（MSBuild）环境代码语言：txtsvg fill="none" height="16" viewbox="0 0 16 16" width="16" xmlns="http://www.w3.org/2000/svg">复制```txt git clone --recursive https://www./link/2aae855e4fce821396110897f2513c60 LightGBMmkdir buildcd buildcmake -DCMAKE_GENERATOR_PLATFORM=x64 ..cmake --build . --target ALL_BUILD --config Release

最终编译生成的exe和dll会在LightGBM/Release目录下。

3.1.2 基于MinGW64环境代码语言：txt复制

txt git clone --recursive https://www./link/2aae855e4fce821396110897f2513c60 LightGBMmkdir buildcd buildcmake -G "MinGW Makefiles" ..mingw32-make.exe -j

最终编译生成的exe和dll会在LightGBM/目录下。

3.2 Linux在Linux系统下，我们同样使用cmake进行编译，运行如下的shell命令：

代码语言：txt复制txt git clone --recursive https://www./link/2aae855e4fce821396110897f2513c60 LightGBMmkdir buildcd buildcmake ..make -j

3.3 macOSLightGBM依赖OpenMP来编译，但它不支持苹果的Clang，请使用gcc/g++替代。运行如下的命令进行编译：
代码语言：txt复制
txt brew install cmakebrew install gcc --without-multilibgit clone --recursive https://www./link/2aae855e4fce821396110897f2513c60 LightGBMmkdir build cd buildcmake ..make -j
在我们开始构建第一个LightGBM模型之前，让我们先了解一下LightGBM的一些参数，以更好地理解其基本过程。
LightGBM的重要参数task：默认值=train，可选项=train，prediction；指定希望执行的任务，有两种类型：训练和预测；application：默认值=regression，类型=enum，选项=options；regression：执行回归任务；binary：二分类；multiclass：多分类；lambdarank：lambdarank应用；data：类型=string；训练数据，LightGBM将从这些数据中进行训练；num_iterations：默认值为100，类型为int。表示提升迭代次数，即提升树的数量；num_leaves：每个树上的叶子数，默认值为31，类型为int；device：默认值=cpu；可选项：cpu，gpu。指定使用什么类型的设备进行训练。选择GPU会使得训练过程更快；min_data_in_leaf：每个叶子上的最少数据；feature_fraction：默认值为1；指定每次迭代所需的特征部分；bagging_fraction：默认值为1；指定每次迭代所需的数据部分，通常用于提升训练速度和避免过拟合。min_gain_to_split：默认值为1；执行分裂的最小信息增益；max_bin：最大的桶的数量，用来装数值的；min_data_in_bin：每个桶内最少的数据量；num_threads：默认值为OpenMP_default，类型为int。指定LightGBM算法运行时线程的数量；label：类型为string；指定标签列；categorical_feature：类型为string；指定用于模型训练的特征类别；num_class：默认值为1，类型为int；仅在多分类情况下需要。5. LightGBM与XGBoost对比现在让我们通过在同一数据集上进行训练，对比一下LightGBM和XGBoost的性能差异。
我们使用的数据集来自多个国家的个人信息。我们的目标是基于其他基本信息预测每个人的年收入是否超过50K（两种）。该数据集包含32561个被观测者和14个描述每个个体的特征。这里是数据集的链接：http://archive.ics.uci.edu/ml/datasets/Adult。
通过对数据集的预测变量有正确的理解，你才能更好地理解下面的代码。
代码语言：txt复制txt
importing standard libraries import numpy as np import pandas as pd from pandas import Series, DataFrame #import lightgbm and xgboost import lightgbm as lgb import xgboost as xgb #loading our training dataset 'adult.csv' with name 'data' using pandas data=pd.read_csv('adult.csv',header=None) #Assigning names to the columns data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','nativecountry','Income'] #glimpse of the dataset data.head() # Label Encoding our target variable from sklearn.preprocessing import LabelEncoder,OneHotEncoderl=LabelEncoder() l.fit(data.Income) l.classes data.Income=Series(l.transform(data.Income))  #label encoding our target variable data.Income.value_counts()  #One Hot Encoding of the Categorical features one_hot_workclass=pd.get_dummies(data.workclass) one_hot_education=pd.get_dummies(data.education) one_hot_marital_Status=pd.get_dummies(data.marital_Status) one_hot_occupation=pd.get_dummies(data.occupation)one_hot_relationship=pd.get_dummies(data.relationship) one_hot_race=pd.get_dummies(data.race) one_hot_sex=pd.get_dummies(data.sex) one_hot_native_country=pd.get_dummies(data.native_country) #removing categorical features data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True)  #Merging one hot encoded features with our dataset 'data' data=pd.concat([data,one_hot_workclass,one_hot_education,one_hot_marital_Status,one_hot_occupation,one_hot_relationship,one_hot_race,one_hot_sex,one_hot_nativecountry],axis=1) #removing dulpicate columns  , i = np.unique(data.columns, return_index=True) data=data.iloc[:, i] #Here our target variable is 'Income' with values as 1 or 0.  #Separating our data into features dataset x and our target dataset y x=data.drop('Income',axis=1) y=data.Income  #Imputing missing values in our target variable y.fillna(y.mode()[0],inplace=True) #Now splitting our dataset into test and train from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
5.1 使用XGBoost代码语言：txt复制```txt
The data is stored in a DMatrix object #label is used to define our outcome variabledtrain=xgb.DMatrix(x_train,label=y_train)dtest=xgb.DMatrix(x_test)#setting parameters for xgboostparameters={'max_depth':7, 'eta':1, 'silent':1,'objective':'binary:logistic','eval_metric':'auc','learning_rate':.05}#training our model num_round=50from datetime import datetime start = datetime.now() xg=xgb.train(parameters,dtrain,num_round) stop = datetime.now()#Execution time of the model execution_time_xgb = stop-start print(execution_time_xgb)#datetime.timedelta( , , ) representation => (days , seconds , microseconds) #now predicting our model on test set ypred=xg.predict(dtest) print(ypred)#Converting probabilities into 1 or 0  for i in range(0,9769):     if ypred[i]>=.5:       # setting threshold to .5        ypred[i]=1     else:        ypred[i]=0         #calculating accuracy of our model from sklearn.metrics import accuracy_score accuracy_xgb = accuracy_score(y_test,ypred) print(accuracy_xgb)
5.2 使用LightGBM代码语言：txt复制```txt
train_data=lgb.Dataset(x_train,label=y_train)setting parameters for lightgbmparam = {'num_leaves':150, 'objective':'binary','max_depth':7,'learning_rate':.05,'max_bin':200}param['metric'] = ['auc', 'binary_logloss']#Here we have set max_depth in xgb and LightGBM to 7 to have a fair comparison between the two.#training our model using light gbmnum_round=50start=datetime.now()lgbm=lgb.train(param,train_data,num_round)stop=datetime.now()#Execution time of the modelexecution_time_lgbm = stop-startprint(execution_time_lgbm)#predicting on test setypred2=lgbm.predict(x_test)print(ypred2[0:5])  # showing first 5 predictions#converting probabilities into 0 or 1for i in range(0,9769):    if ypred2[i]>=.5:       # setting threshold to .5       ypred2[i]=1    else:         ypred2[i]=0#calculating accuracyaccuracy_lgbm = accuracy_score(ypred2,y_test)accuracy_lgbmy_test.value_counts()from sklearn.metrics import roc_auc_score#calculating roc_auc_score for xgboostauc_xgb =  roc_auc_score(y_test,ypred)print(auc_xgb)#calculating roc_auc_score for light gbm. auc_lgbm = roc_auc_score(y_test,ypred2)auc_lgbm comparison_dict = {'accuracy score':(accuracy_lgbm,accuracy_xgb),'auc score':(auc_lgbm,auc_xgb),'execution time':(execution_time_lgbm,execution_time_xgb)}#Creating a dataframe ‘comparison_df’ for comparing the performance of Lightgbm and xgb. comparison_df = DataFrame(comparison_dict) comparison_df.index= ['LightGBM','xgboost'] print(comparison_df)
5.3 性能对比下面的表格列出了算法的各项指标对比结果：
算法
accuracy score
auc score
执行时间(S)
LightGBM
0.861501
0.764492
0.283759
XGBoost
0.861398
0.764284
2.047220
从上述的性能对比结果来看，LightGBM与XGBoost相比在准确率和AUC值上只有很小的提升。然而，一个关键的区别在于模型训练过程的执行时间。LightGBM的训练速度几乎是XGBoost的7倍，并且随着训练数据量的增加，这种差异会变得更加明显。
这证明了LightGBM在大数据集上训练时的巨大优势，尤其是在时间有限的对比中。
5.4 详细对比对比项
XGBoost
LightGBM
正则化
L1/L2
L1/L2
列采样
是
是
精确梯度
是
是
近似算法
是
否
稀疏数据
是
是
分布式并行
是
是
缓存
是
否
out of core
是
否
加权数据
是
是
树增长方式
level-wise
leaf-wise
基于算法
pre-sorted
histogram
最大树深度控制
无
有
dropout
否
是
Bagging
是
是
用途
回归、分类、rank
回归、分类、lambdarank
GPU支持
否
是
网络通信
点对点
集体通信
CategoricalFeatures
无优化
优化
继续训练输入GBDT模型
否
是
继续训练输入
否
是
早期停止（训练和预测）
否
是
LightGBM的参数调优LightGBM使用基于depth-wise的分裂的leaf-wise分裂算法，使其能够更快地收敛。但这也可能导致过拟合。因此，这里提供一个LightGBM参数调优的快速指南。
6.1 为了最佳拟合num_leaves：此参数用于设置每棵树的叶子数量。num_leaves和max_depth理论上的联系是：num_leaves = 2^(max_depth)。然而，在使用LightGBM的情况下，这种估计不准确，因为它使用leaf-wise而不是depth-wise分裂叶子节点。因此，num_leaves必须设置为小于2^(max_depth)的值。否则，可能会导致过拟合。LightGBM的num_leaves和max_depth之间没有直接联系。因此，我们不应将两者联系在一起。min_data_in_leaf：这也是一个解决过拟合的重要参数。将其值设置得过小可能会导致过拟合，因此，我们需要进行相应的设置。对于大数据集，我们应该将其值设置为几百到几千。max_depth：它指定每棵树的最大深度或其生长的层数上限。6.2 为了更快的速度bagging_fraction：用于执行更快的结果装袋；feature_fraction：设置每次迭代使用的特征子集；max_bin：max_bin的值越小越能节省时间：当它将特征值分桶装进不同的桶中时，这在计算上是廉价的。6.3 为了更高的准确率使用更大的训练数据集；num_leaves：将其设置得过大会使树的深度更高，准确率也随之提升，但这会导致过拟合。因此，过高设置其值是不好的。max_bin：该值设置得越高，效果与num_leaves的增长效果相似，并且会导致我们的训练过程变得缓慢。7. 结束语在本文中，我提供了关于LightGBM的直观想法。目前使用该算法的一个缺点是其用户基础较少。但这种局面将很快改变。除了比XGBoost更精确和节省时间外，目前使用较少的原因是其可用文档较少。
然而，该算法已经展示出在结果上远超其他现有提升算法。我强烈推荐你使用LightGBM与其他提升算法进行对比，并亲自感受它们的不同。
也许现在说LightGBM算法称雄还为时过早，但它确实挑战了XGBoost的地位。给你一句警告：就像其他任何机器学习算法一样，在使用它进行模型训练之前，确保你正确调试了参数。
---

标签：# 更快 # visual studio # macos # 算法 # sklearn # boosting # https # microsoft # 更高 # using # 值为 # 让我们 # 将其 # 是一个 # 这是 # 较少 # 特征值 # linux # for # if # print # 分布式 # mac # 苹果 # 工具 # app # 大数据 # 操作系统 # github # windows # svg # go # git

上一篇：Win7怎么设置图片默认打开方式？

下一篇：Linux之系统文件概述

# LightGBM大战XGBoost，谁将夺得桂冠？

发布时间：2025-09-14

点击量：

返回

4008888355