Exploring LightGBM: Leaf-Wise Growth with GBDT and GOSS

LightGBM is a highly efficient gradient boosting framework. It has gained traction for its speed and performance, particularly with large and complex datasets. Developed by Microsoft, this powerful algorithm is known for its unique ability to handle large volumes of data with significant ease compared to traditional methods.

In this post, we will experiment with LightGBM framework on the Ames Housing dataset. In particular, we will shed some light on its versatile boosting strategies—Gradient Boosting Decision Tree (GBDT) and Gradient-based One-Side Sampling (GOSS). These strategies offer distinct advantages. Through this post, we will compare their performance and characteristics.

We begin by setting up LightGBM and proceed to examine its application in both theoretical and practical contexts.

Let’s get started.

LightGBM
Photo by Marcus Dall Col. Some rights reserved.

Overview

This post is divided into four parts; they are:

  • Introduction to LightGBM and Initial Setup
  • Testing LightGBM’s GBDT and GOSS on the Ames Dataset
  • Fine-Tuning LightGBM’s Tree Growth: A Focus on Leaf-wise Strategy
  • Comparing Feature Importance in LightGBM’s GBDT and GOSS Models

Introduction to LightGBM and Initial Setup

LightGBM (Light Gradient Boosting Machine) was developed by Microsoft. It is a machine learning framework that provides the necessary components and utilities to build, train, and deploy machine learning models. The models are based on decision tree algorithms and use gradient boosting at its core. The framework is open source and can be installed on your system using the following command:

1pip install lightgbm

This command will download and install the LightGBM package along with its necessary dependencies.

While LightGBM, XGBoost, and Gradient Boosting Regressor (GBR) are all based on the principle of gradient boosting, several key distinctions set LightGBM apart due to both its default behaviors and a range of optional parameters that enhance its functionality:

  • Exclusive Feature Bundling (EFB): As a default feature, LightGBM employs EFB to reduce the number of features, which is particularly useful for high-dimensional sparse data. This process is automatic, helping to manage data dimensionality efficiently without extensive manual intervention.
  • Gradient-Based One-Side Sampling (GOSS): As an optional parameter that can be enabled, GOSS retains instances with large gradients. The gradient represents how much the loss function would change if the model’s prediction for that instance changed slightly. A large gradient means that the current model’s prediction for that data point is far from the actual target value. Instances with large gradients are considered more important for training because they represent areas where the model needs significant improvement. In the GOSS algorithm, instances with large gradients are often referred to as “under-trained” because they indicate areas where the model’s performance is poor and needs more focus during training. The GOSS algorithm specifically retains all instances with large gradients in its sampling process, ensuring that these critical data points are always included in the training subset. On the other hand, instances with small gradients are considered “well-trained” because the model’s predictions for these points are closer to the actual values, resulting in smaller errors.
  • Leaf-wise Tree Growth: Whereas both GBR and XGBoost typically grow trees level-wise, LightGBM default tree growth strategy is leaf-wise. Unlike level-wise growth, where all nodes at a given depth are split before moving to the next level, LightGBM grows trees by choosing to split the leaf that results in the largest decrease in the loss function. This approach can lead to asymmetric, irregular trees of larger depth, which can be more expressive and efficient than balanced trees grown level-wise.

These are a few characteristics of LightGBM that differentiate it from the traditional GBR and XGBoost. With these unique advantages in mind, we are prepared to delve into the empirical side of our exploration.

Testing LightGBM’s GBDT and GOSS on the Ames Dataset

Building on our understanding of LightGBM’s distinct features, this segment shifts from theory to practice. We will utilize the Ames Housing dataset to rigorously test two specific boosting strategies within the LightGBM framework: the standard Gradient Boosting Decision Tree (GBDT) and the innovative Gradient-based One-Side Sampling (GOSS). We aim to explore these techniques and provide a comparative analysis of their effectiveness.

# Import libraries to run LightGBM
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score
 
# Load the Ames Housing Dataset
data = pd.read_csv('Ames.csv')
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']
 
# Convert categorical columns to 'category' dtype
categorical_cols = X.select_dtypes(include=['object']).columns
X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype('category'))
 
# Define the default GBDT model
gbdt_model = lgb.LGBMRegressor()
gbdt_scores = cross_val_score(gbdt_model, X, y, cv=5)
print(f"Average R² score for default Light GBM (with GBDT): {gbdt_scores.mean():.4f}")
 
# Define the GOSS model
goss_model = lgb.LGBMRegressor(boosting_type='goss')
goss_scores = cross_val_score(goss_model, X, y, cv=5)
print(f"Average R² score for Light GBM with GOSS: {goss_scores.mean():.4f}")

Results:

Average R² score for default Light GBM (with GBDT): 0.9145
Average R² score for Light GBM with GOSS: 0.9109

The initial results from our 5-fold cross-validation experiments provide intriguing insights into the performance of the two models. The default GBDT model achieved an average R² score of 0.9145, demonstrating robust predictive accuracy. On the other hand, the GOSS model, which specifically targets instances with large gradients, recorded a slightly lower average R² score of 0.9109.

The slight difference in performance might be attributed to the way GOSS prioritizes certain data points over others, which can be particularly beneficial in datasets where mispredictions are more concentrated. However, in a relatively homogeneous dataset like Ames, the advantages of GOSS may not be fully realized.

Fine-Tuning LightGBM’s Tree Growth: A Focus on Leaf-wise Strategy

One of the distinguishing features of LightGBM is its ability to construct decision trees leaf-wise rather than level-wise. This leaf-wise approach allows trees to grow by optimizing loss reductions, potentially leading to better model performance but posing a risk of overfitting if not properly tuned. In this section, we explore the impact of varying the number of leaves in a tree.

# Experiment with Leaf-wise Tree Growth
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import cross_val_score
 
# Load the Ames Housing Dataset
data = pd.read_csv('Ames.csv')
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']
 
# Convert categorical columns to 'category' dtype
categorical_cols = X.select_dtypes(include=['object']).columns
X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype('category'))
 
# Define a range of leaf sizes to test
leaf_sizes = [5, 10, 15, 31, 50, 100]
 
# Results storage
results = {}
 
# Experiment with different leaf sizes for GBDT
results['GBDT'] = {}
print("Testing different 'num_leaves' for GBDT:")
for leaf_size in leaf_sizes:
    model = lgb.LGBMRegressor(boosting_type='gbdt', num_leaves=leaf_size)
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    results['GBDT'][leaf_size] = scores.mean()
    print(f"num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}")
 
# Experiment with different leaf sizes for GOSS
results['GOSS'] = {}
print("\nTesting different 'num_leaves' for GOSS:")
for leaf_size in leaf_sizes:
    model = lgb.LGBMRegressor(boosting_type='goss', num_leaves=leaf_size)
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    results['GOSS'][leaf_size] = scores.mean()
    print(f"num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}")

Results:

Testing different 'num_leaves' for GBDT:
num_leaves = 5: Average R² score = 0.9150
num_leaves = 10: Average R² score = 0.9193
num_leaves = 15: Average R² score = 0.9158
num_leaves = 31: Average R² score = 0.9145
num_leaves = 50: Average R² score = 0.9111
num_leaves = 100: Average R² score = 0.9101
 
Testing different 'num_leaves' for GOSS:
num_leaves = 5: Average R² score = 0.9151
num_leaves = 10: Average R² score = 0.9168
num_leaves = 15: Average R² score = 0.9130
num_leaves = 31: Average R² score = 0.9109
num_leaves = 50: Average R² score = 0.9117
num_leaves = 100: Average R² score = 0.9124

The results from our cross-validation experiments provide insightful data on how the num_leaves parameter influences the performance of GBDT and GOSS models. Both models perform optimally at a num_leaves setting of 10, achieving the highest R² scores. This indicates that a moderate level of complexity suffices to capture the underlying patterns in the Ames Housing dataset without overfitting. This finding is particularly interesting, given that the default setting for num_leaves in LightGBM is 31.

For GBDT, increasing the number of leaves beyond 10 leads to a decrease in performance, suggesting that too much complexity can detract from the model’s generalization capabilities. In contrast, GOSS shows a slightly more tolerant behavior towards higher leaf counts, although the improvements plateau, indicating no further gains from increased complexity.

This experiment underscores the importance of tuning num_leaves in LightGBM. By carefully selecting this parameter, we can effectively balance model accuracy and complexity, ensuring robust performance across different data scenarios. Further experimentation with other parameters in conjunction with num_leaves could potentially unlock even better performance and stability.

Comparing Feature Importance in LightGBM’s GBDT and GOSS Models

After fine-tuning the num_leaves parameter and assessing the basic performance of the GBDT and GOSS models, we now shift our focus to understanding the influence of individual features within these models. In this section, we explore the most important features by each boosting strategy through visualization.

Here is the code that achieves this:

# Importing libraries to compare feature importance between GBDT and GOSS: 
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
import seaborn as sns
 
# Prepare data
data = pd.read_csv('Ames.csv')
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']
categorical_cols = X.select_dtypes(include=['object']).columns
X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype('category'))
 
# Set up K-fold cross-validation
kf = KFold(n_splits=5)
gbdt_feature_importances = []
goss_feature_importances = []
 
# Iterate over each split
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Train GBDT model with optimal num_leaves
    gbdt_model = lgb.LGBMRegressor(boosting_type='gbdt', num_leaves=10)
    gbdt_model.fit(X_train, y_train)
    gbdt_feature_importances.append(gbdt_model.feature_importances_)
    
    # Train GOSS model with optimal num_leaves
    goss_model = lgb.LGBMRegressor(boosting_type='goss', num_leaves=10)
    goss_model.fit(X_train, y_train)
    goss_feature_importances.append(goss_model.feature_importances_)
 
# Average feature importance across all folds for each model
avg_gbdt_feature_importance = np.mean(gbdt_feature_importances, axis=0)
avg_goss_feature_importance = np.mean(goss_feature_importances, axis=0)
 
# Convert to DataFrame
feat_importances_gbdt = pd.DataFrame({'Feature': X.columns, 'Importance': avg_gbdt_feature_importance})
feat_importances_goss = pd.DataFrame({'Feature': X.columns, 'Importance': avg_goss_feature_importance})
 
# Sort and take the top 10 features
top_gbdt_features = feat_importances_gbdt.sort_values(by='Importance', ascending=False).head(10)
top_goss_features = feat_importances_goss.sort_values(by='Importance', ascending=False).head(10)
 
# Plotting
plt.figure(figsize=(16, 12))
plt.subplot(1, 2, 1)
sns.barplot(data=top_gbdt_features, y='Feature', x='Importance', orient='h', palette='viridis')
plt.title('Top 10 LightGBM GBDT Features', fontsize=18)
plt.xlabel('Importance', fontsize=16)
plt.ylabel('Feature', fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=14)
 
plt.subplot(1, 2, 2)
sns.barplot(data=top_goss_features, y='Feature', x='Importance', orient='h', palette='viridis')
plt.title('Top 10 LightGBM GOSS Features', fontsize=18)
plt.xlabel('Importance', fontsize=16)
plt.ylabel('Feature', fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=14)
 
plt.tight_layout()
plt.show()

Using the same Ames Housing dataset, we applied a k-fold cross-validation method to maintain consistency with our previous experiments. However, this time, we concentrated on extracting and analyzing the feature importance from the models. Feature importance, which indicates how useful each feature is in constructing the boosted decision trees, is crucial for interpreting the behavior of machine learning models. It helps in understanding which features contribute most to the predictive power of the model, providing insights into the underlying data and the model’s decision-making process.

Here’s how we performed the feature importance extraction:

  1. Model Training: Each model (GBDT and GOSS) was trained across different folds of the data with the optimal num_leaves parameter set to 10.
  2. Importance Extraction: After training, each model’s feature importance was extracted. This importance reflects the number of times a feature is used to make key decisions with splits in the trees.
  3. Averaging Across Folds: The importance was averaged over all folds to ensure that our results were stable and representative of the model’s performance across different subsets of the data.

The following visualizations succinctly present these differences in feature importance between the GBDT and GOSS models:

The analysis revealed interesting patterns in feature prioritization by each model. Both the GBDT and GOSS models exhibited a strong preference for “GrLivArea” and “LotArea,” highlighting the fundamental role of property size in determining house prices. Additionally, both models ranked ‘Neighborhood’ highly, underscoring the importance of location in the housing market.

However, the models began to diverge in their prioritization from the fourth feature onwards. The GBDT model showed a preference for “BsmtFinSF1,” indicating the value of finished basements. On the other hand, the GOSS model, which prioritizes instances with larger gradients to correct mispredictions, emphasized “OverallQual” more strongly.

As we conclude this analysis, it’s evident that the differences in feature importance between the GBDT and GOSS models provide valuable insights into how each model perceives the relevance of various features in predicting housing prices.

Retrieved from: https://machinelearningmastery.com/exploring-lightgbm-leaf-wise-growth-with-gbdt-and-goss/

Author: Vinod Chugani