How to choose the number of estimators for Gradient Boosting

In Data Science, there are many algorithms available for use these days. One useful technique, therefore, is to use combine them in a single model to get the best out of each, resulting in a more accurate model.

Using Scikit-Learn, you will find Random Forest algorithm, that is the bagging kind of ensemble model. On the other hand, you will also find Boosting models, that train the estimators in sequence, where the result of one model is passed to the next one, that will try to improve the predictions, until they reach an optimal result.

When creating a Gradient Boosting estimator, you will find this hyperparameter n_estimator=100 with a default value of 100 trees to be created to get to a result. Many times, we just set this to the default or maybe increase as needed, even using Grid Search techniques.

In this post, we will find a simple way to get to a single number to use to train our model.

Gradient Boosting can be loaded from Scikit-Learn using this class from sklearn.ensemble import GradientBoostingRegressor. The Gradient Boosting algorithm can be used either for classification or for Regression models. It is a Tree based estimator — meaning that it is composed of many decision trees.

The result of the Tree 1 will generate errors. Those errors will be used and the input for the Tree 2. Once again, the errors of the last model will be used and the input of the next one, until it reaches the n_estimators value.

Each model will fit the errors of the previous one. Image by the author.

Since each estimator will fit the error of the previous one, the expectation is that the combination of the predictions will be better than any of the estimator’s alone. After each iteration, we are making the model more complex, reducing bias but increasing the variance, on the flip side. So we must know when to stop.

Let’s see how to do that now.

The code for this exercise is simple. All we must do is a loop after each iteration and check at which one we had the lowest error.

Let’s begin by choosing a dataset. We will use the car_crashes dataset, native from the seaborn library (so an open data under BDS license).

# Dataset
df = sns.load_dataset('car_crashes')

Here’s a quick look at the data. We will try to estimate the total amount using the other features as predictors. Since it’s a real number output, we’re talking about a regression model.

Car Crashes dataset, from seaborn. Image by the author.

Quickly looking at the correlations.

# Correlations
df.corr().style.background_gradient(cmap='coolwarm')
Correlations in the dataset. Image by the author.

Ok, no major multicollinearity. We can see that ins_premium and ins_losses don’t correlate very well with the total , so we will not consider them in the model.

If we check the missing data, there are none

# Missing
df.isnull().sum()
0

Nice, so let’s split the data now.

# X and y
X = df.drop(['ins_premium', 'ins_losses', 'abbrev', 'total'], axis=1)
y = df['total']
# Train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22)

We can create a pipeline to scale the data and model it (it is really not very necessary to scale this data, since they’re in the same scale already, on the tens base). Next, we fit the data to the model and predict the results.

I am using 500 estimators with a learning_rate of 0.3.

The learning rate is the size of the step we take to get to the minimum error. If we use a value that is too high, we may pass the minimum. If we use a number that is too small, we may not even get close to it. So, a rule of thumb you can consider is: if you have a large number of estimators, you can use lower values of learning rate. If you have just a few estimators, prefer using higher values of learning rate.

steps = [('scale', StandardScaler()),
('GBR', GradientBoostingRegressor(n_estimators=500, learning_rate=0.03)) ]
# Instance Pipeline and fit
pipe = Pipeline(steps).fit(X_train, y_train)
# Predict
preds = pipe.predict(X_test)

Now, evaluating.

# RMSE of the predictions
print(f'RMSE: { round(np.sqrt(mean_squared_error(y_test, preds)),1 )}')
[OUT]: RMSE: 1.1# Mean of the true Y values
print(f'Data y mean: {round( y.mean(),1 )}')
[OUT]: Data y mean: 15.8

Good. Our RMSE is about 6.9% of the mean. So we’re off by this much, on average.

Now let’s check a way to tune our model by choosing the optimal number of estimators to train that will give us the lowest error rate.

Like I said, we don’t really have to scale this data because it is in the same proportion already. So let’s fit the model.

#Model
gbr = GradientBoostingRegressor(n_estimators=500, learning_rate=0.3).fit(X_train, y_train)

Now it is the good stuff. There is a method in Gradient Boosting that allows us to iterate over the predictions of each estimator trained, from 1 to 500. So, we will create a loop that goes through the 500 estimators in the gbr model, predicts results using the method staged_predict(), calculates the mean squared error and store the result in the list errors.

# Loop for the best number
errors = [ mean_squared_error(y_test, preds) for preds in gbr.staged_predict(X_test)]
# Optimal number of estimators
optimal_num_estimators = np.argmin(errors) + 1

Next, we can plot the result.

#Plot
g=sns.lineplot(x=range(500), y=errors)
g.set_title(f'Best number of estimators at {best_n_estimators}', size=15);
Best number of estimators. Image by the author.

We see that the lowest error rate is with 34 estimators. So, let’s retrain our model with 34 estimators and compare with the result from the model trained with the pipeline.

# Retrain
gbr = GradientBoostingRegressor(n_estimators=34, learning_rate=0.3).fit(X_train, y_train)
# Predictions
preds2 = gbr.predict(X_test)

Evaluating…

# RMSE of the predictions
print(f'RMSE: { round(np.sqrt(mean_squared_error(y_test, preds2)),1 )}')
[OUT]: RMSE: 1.0
# Data Y mean
print(f'Data y mean: {round( y.mean(),1 )}')
[OUT]: Data y mean: 15.8

We went down from 6.9% to 6.3% off now. Approx. 9% better. Let’s look at a few predictions.

Predictions from both models. Image by the author.

Interesting results. Some of the predictions of the second model are better than the first one.

We learned how to determine the best number of estimators to tweak a GradientBoostingRegressor from Scikit-Learn. This is a hyperparameter that can make a difference in this kind of ensemble model, that trains estimators in sequence.

Sometimes, after a few iterations, the model can start to overfit, thus it will start to increase the variance too much, impacting the predictions.

We saw that a simple loop can help us to find the optimal solution in this case. But, certainly, for large datasets it can be expensive to calculate, so an idea would be try a lower n_estimators at first and see if you can reach the minimum error soon enough.

Here’s the complete code in GitHub.

If you liked this content, follow my blog.

Find me on LinkedIn.

This exercise was based on the excellent text book by Aurélien Géron, in the reference.

How to choose the number of estimators for Gradient Boosting Republished from Source https://towardsdatascience.com/how-to-choose-the-number-of-estimators-for-gradient-boosting-8d06920ab891?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>

Time Stamp:

More from Blockchain Consultants