Multi-Output Gradient Boosting for probability distributions

TLDR – Some thoughts on Gradient Boosting with multidimensional output and how to apply it to output full probability distributions in boosting models. If you need more than just point estimates for your boosting problem, this article might be helpful.

Introduction

Gradient Boosting is arguably one of the most popular Machine Learning algorithms nowadays. Combining multiple weak learners in order to generate a strong one seems almost too good to be true at first.

Nevertheless, respective packages like xgboost regularly shine in many online competitions. The ability to ‘just work’ makes boosting models a favorable tool in industry applications, too.

Popular libraries like lightGBM or said xgboost provide many tools for a variety of different use-cases. One particular feature however, namely arbitrary multi-output boosting, doesn’t seem to be available in these packages yet.

A quick Google search will provide some explanations on how to use sklearn’s MultiOutputRegressor in such cases.

This solution can work for suitable loss functions. Still, there are cases where the MultiOutputRegressor-approach would fail. In these situations, we need to dive deeper into the background of Gradient Boosting and implement some elements manually.

“There are cases where the intuitive solution for Multi-Output Gradient Boosting will fail. In these situations, we need to dive deeper into the theory of the algorithm and implement some elements manually.”

Hence, today, I want to give you some idea on how Multi-Output Gradient Boosting can be done ‘manually’. The focus will be on ‘boosting’ multiple parameters of a target variable’s probability distribution. If you are not interested in the theory, feel free to skip the following section.


Theory (skip this section at your liking)

The general idea behind Gradient Boosting can, roughly be, summarized in three steps:

  1. Use an initial shallow learner to minimize a given loss
  1. Let a second shallow learner learn to predict the loss’ derivative with respect to the first shallow learner’s prediction
  1. Add the second learner’s prediction times a constant to the first learner’s prediction

The constant in 3) is chosen such that the combined prediction minimizes the target loss. Afterwards, we repeat steps 2) and 3) respectively until an maximum number of iterations is reached.

NOTE: Most explanations of Gradient Boosting consider the negative derivative in 2). As in standard gradient descent, this is necessary for minimization. Given that the learned constant in 3) can have either sign, the sign of the derivative can be ignored however.

For simplicity, let us therefore work with the positive derivatives.

In-depth explanation of step 1) – Initialization

Technically, we start with an initial prediction at iteration k=0:

This initial prediction can be an optimized constant such that

for an arbitrary loss. For simplicity, we will define a fixed constant without optimizing it first.

For a continuous target, the loss function is usually the mean squared error (MSE) although that is not the only possibility. As you can see in the xgboost documentation, there are many more options available.

The only requirement for our loss is that it has to be differentiable with respect to the predictions.

In-depth explanation of step 2) – Gradient estimation

Obviously, we need to calculate the loss function’s derivative in this step. It’s easiest to start with the general formulation and then look at a concrete example. For an arbitrary, properly differentiable loss we have:

Next, let us use the MSE for an actual example. The MSE is simply

Plugging this into the above derivative formula, we get

Hence, Gradient Boosting on an MSE loss is equivalent to using the estimation residuals. Notice that this is not the case for Gradient Boosting in general.

The next-iteration learner now has to estimate the residual:

Or, for the general case:

This is done per observation in your dataset. For clarity, let us introduce a subscript index i to denote a single instance:

In-depth explanation of step 3) – Model update

Finally, we need to update the full model as follows:

This is done through a simple optimization problem over the full dataset of size N:

We can either provide gradient information to an optimizer or use e.g. scipy.optimize.minimize for black-box (without explicit derivatives) optimization.

From here, we only need to re-iterate the above steps until we have a model with K+1 shallow learners in total:

Personally, I find it pretty amazing that such a simple algorithm produces such great results in practice.


Multi-Output Gradient Boosting

Finally, we can switch to the actual topic of this article. Our most important element for this case is a proper multi-output loss function. That loss has to be able to condense outputs from multiple models into a single quantity.

Let us introduce some notation for the multi-output case with M outputs:

Multi-Output MSE

The easiest extension for multi-output, continuous regression is the sum of individual MSEs:

Now, we need to calculate the derivative for each output which yields

This tells us, essentially, that we can run a separate Gradient Boosting instance for each output. In such cases, the MultiOutputRegressor will work without further ado.

To demonstrate why such separation is not necessarily possible, consider a multi-class problem with M target classes.

Multi-Class Gradient Boosting

We use a multi-class crossentropy loss and softmax-transform the outputs to obtain valid class probabilities. This yields

Using onehot-encoding for the targets, only one element in the above sum is non-zero. We denote this as m^star and get rid of the sum:

Now, for the derivatives, we obtain

As you can see, the derivative for each output depends on the outputs of all other Gradient Boosting instances. Hence, it is not possible anymore to just treat this multi-output problem as M separate problems.

This is different from the MSE example above.

The broader implication is that we can only use the MultiOutputRegressor-approach from before if the loss function is appropriate. Once the per-output derivative does not depend on the respective output alone, the resulting Gradient Boosting problem becomes non-trivial.

“Once the per-output derivative does not depend on the respective output alone, the resulting Gradient Boosting problem becomes non-trivial.”

For multi-class problems, we luckily have the necessary algorithms ready in the standard libraries. More fancy things, however, will likely require a manual implementation at the moment.


Gradient Boosting for probability distributions

One of those more-fancy-things is predicting the parameters of a conditional probability distribution. Consider a general probabilistic regression setup:

Our goal here is to predict a probability distribution for target y given input x. In plain linear regression, this typically looks as follows:

This conditional probability is simply a Gaussian whose mean depends linearly on the input. The variance is presumed to be constant. Respective parameters can easily be estimated with maximum likelihood.

Now, let us replace the linear mean and constant variance terms each with a Gradient Boosting model:

In order to optimize our model, we use the negative log-likelihood function as the loss function

Notice that this is indeed a multi-output problem for two distributional parameters. The target variable itself however is still one-dimensional.

“The conditional Gaussian model is indeed a multi-output problem for two distributional parameters. The target variable itself however is still one-dimensional.”

Finally, we can calculate the necessary derivatives. For the Gradient Boosting model of the mean, we have:

For the Gradient Boosting model of the standard deviation, we have

And that’s it for the theory section. We are now ready to build a POC level implementation.


A quick demonstration

Numpy, sklearn and scipy offer everything we need for our proof of concept.

import numpy as np

from sklearn.tree import DecisionTreeRegressor

import matplotlib.pyplot as plt

from scipy.optimize import minimize

Loss and derivative functions

First, we define the loglikelihood function for a plain normal distribution:

def normal_log_likelihood(y,mu,sigma):
    return -0.5*np.log(2*np.pi) -0.5*np.log(sigma**2) - 1/(2*sigma**2)*(y-mu)**2

In order to properly use scipy.optimize.minimize, we create a wrapper function. That wrapper can then be plugged into the optimizer as a lambda function.

Notice that the loglikelihood now has negative sign as we want to maximize the loglikelihood. This is necessary to maximize the loglikelihood with a factual minimization algorithm.

def loss_wrapper(y, current_mu_prediction, current_sigma_prediction, mugrad, sigmagrad, mugamma, sigmagamma):
    mu = current_mu_prediction + mugamma*mugrad
    sigma = current_sigma_prediction + sigmagamma*sigmagrad
    
    return -np.sum(normal_log_likelihood(y,mu,sigma))

Finally, we need the derivatives of the loss function:

def normal_mu_gradient(y,mu,sigma):
    return 1/sigma**2 * (y-mu)


def normal_sigma_gradient(y,mu,sigma):
    return -1/sigma + 1/(sigma**3)*(y-mu)**2

Sample data

Our data generating process should be non-linear and have non-constant variance. For visualization, we also keep the input data one-dimensional.

A simple process that fulfills these requirements is

The choice was completely arbitrary. We only want to see if our model works at this point.

sample_size = 2500

np.random.seed(123)

#training data
X = np.random.uniform(size=(sample_size,1))*6 -3
y = 2*np.sin(X) + np.random.normal(size=(sample_size,1))*(0.25+0.5*np.abs(X))


#simple evaluation data (just a linspace from -3 to 3)
Xline = np.linspace(-3,3,500).reshape(-1,1)
yline = 2*np.sin(Xline)

plt.figure(figsize=(12,8))
plt.scatter(X,y,alpha=0.15,s=5,color="blue", label = "Sample")
plt.plot(Xline,yline,color="green",lw=2, label = "Data Generating process")
plt.fill_between(Xline.reshape(-1),2*np.sin(Xline.reshape(-1))-(0.5+np.abs(Xline.reshape(-1))),2*np.sin(Xline.reshape(-1))+(0.5+np.abs(Xline.reshape(-1))),color="green",alpha=0.2)
plt.grid()

Sample data (blue dots) from the data generating process with non-linear mean (green line) and non-constant variance (green area denotes two standard deviations from the mean)

Running the model

To keep things as simple as possible, we won’t use any Python classes. In case you want to build something on top of this bare-bones implementation, feel free to build wrapper classes.

Our base learners will be simple Decision Tree stumps, say 100.

n_trees = 100

In order to keep track of the training predictions and gammas, we use 4 numpy arrays altogether. That implies that we use separate gammas for the mean and variance models. On the one hand, this introduces more risk of overfitting.

On the other hand, the increased flexibility could improve results. This is a trade-off to be considered, and the choice to use separate gammas was arbitrary as well.

We set the initial mean prediction to zero; the initial standard deviation prediction is set to one.

Additionally, we store all training predictions in an N x n_trees matrix and all gammas in an 1 x n_trees matrix. This allows us to simply multiply and sum the respective prediction and gamma columns for the aggregated boosting output.

boosted_mus = np.zeros(shape=(len(X),n_trees))
boosted_sigmas = np.ones(shape=(len(X),n_trees))

boosted_mu_gammas = np.ones(shape=(1,n_trees))
boosted_sigma_gammas = np.ones(shape=(1,n_trees))

The base learners will be stored in lists for later usage.

mu_models = []
sigma_models = []

Now we can run the training loop. As stated earlier, we will use scipy.optimize.minimize here. In order to not bloat this article too much, we use it as a black-box optimizer. That means we don’t provide any gradient or hessian information to the function.

In a production-ready implementation we might want do so – results could possibly improve.

for i in range(1,n_trees):
    
    # get current predictions for training data
    # => multiply the tree predictions with gamma and sum up; also add the initial prediction
    current_mus = np.sum(boosted_mus[:,:i]*boosted_mu_gammas[:,:i],1)
    current_sigmas = np.sum(boosted_sigmas[:,:i]*boosted_sigma_gammas[:,:i],1)
    
    #calculate derivatives (sign doesn't matter due to learning gamma later on)
    mu_grads = normal_mu_gradient(y.reshape(-1),current_mus,current_sigmas)
    sigma_grads = normal_sigma_gradient(y.reshape(-1),current_mus,current_sigmas)
    
    #Boosting tree for mean model
    mu_tree = DecisionTreeRegressor(max_depth=1).fit(X,mu_grads)
    boosted_mus[:,i] = mu_tree.predict(X)
    mu_models.append((mu_tree))
    
    #Boosting tree for standard deviation model
    sigma_tree = DecisionTreeRegressor(max_depth=1).fit(X,sigma_grads)
    boosted_sigmas[:,i] = sigma_tree.predict(X)
    sigma_models.append((sigma_tree))
       
    #learn optimized gammas - without gradient information, the minimizer will use finite differences to estimate the underlying gradient
    mu_gamma, sigma_gamma = minimize(lambda x: loss_wrapper(y.reshape(-1), current_mus, current_sigmas, boosted_mus[:,i], boosted_sigmas[:,i], x[0], x[1]),
                                     np.zeros(2),method="BFGS").x
    
    boosted_mu_gammas[0,i] = (mu_gamma)
    boosted_sigma_gammas[0,i] = (sigma_gamma)
    
    
#loglikelihood loss after training
final_mus = np.sum(boosted_mus[:,:]*boosted_mu_gammas[:,:],1)
final_sigmas = np.sum(boosted_sigmas[:,:]*boosted_sigma_gammas[:,:],1)
    
loss = np.sum(normal_log_likelihood(y.reshape(-1), current_mus, current_sigmas))
    
print("Final loss: {}".format(loss))
Final loss: -3157.0824954786885

The evaluation process is similar to the training process:

boosted_mus_eval = np.zeros(shape=(len(Xline),n_trees))
boosted_sigmas_eval = np.ones(shape=(len(Xline),n_trees))

for i in range(1,n_trees):
    
    boosted_mus_eval[:,i] = mu_models[i-1].predict(Xline)
    boosted_sigmas_eval[:,i] = sigma_models[i-1].predict(Xline)
    
    
mupred = np.sum(boosted_mus_eval*boosted_mu_gammas[:,:],1)
sigmapred = np.abs(np.sum(boosted_sigmas_eval*boosted_sigma_gammas[:,:],1))


plt.figure(figsize=(12,10))

plt.scatter(X,y,alpha=0.15,s=3,color="blue")


plt.plot(Xline,mupred,color="red")
plt.fill_between(Xline.reshape(-1),mupred-sigmapred*2,mupred+sigmapred*2,color="red",alpha=0.2)

plt.plot(Xline,2*np.sin(Xline),color="green")
plt.fill_between(Xline.reshape(-1),2*np.sin(Xline.reshape(-1))-(0.5+np.abs(Xline.reshape(-1))),2*np.sin(Xline.reshape(-1))+(0.5+np.abs(Xline.reshape(-1))),color="green",alpha=0.2)


plt.grid(alpha=0.25)
Predicted mean function (red line) and corresponding variance (red area = 2 standard deviations from mean)

The results for both mean and standard deviation look reasonable. We could likely improve results with more fine-tuning – here are some ideas:

  • Use a more robust optimization procedure As mentioned earlier, we should ideally provide gradient and hessian functions to the optimizer. Autograd packages like tensorflow and PyTorch could do that automatically
  • Fine tune the model hyperparameters The number of trees and their respective depth would be the obvious place to start.
  • Use a more sophisticated boosting algorithm Our boosting algorithm is pretty bare-bones. There exist many variations and advancements that could easily outperform this implementation.

These considerations would likely help to further improve our results.


Conclusion – what else can we do?

The normal distribution was probably the most obvious choice, however there are many more interesting choices. Consider, for example, a skewed version of the Gaussian distribution.

Most data in practice are far from symmetric around the mean. Accounting for such behavior could turn out to be fairly advantageous.

However, we definitely increase the risk of overfitting with an increasing number of distributional parameters. Each new parameter in our target distribution means another Gradient Boosting model after all.

To mitigate that risk, regularization will be necessary. The simplest form of regularisation might be a decrease of the amount of tree stumps (n_trees) in our algorithm. This could already suffice before we even have to consider more sophisticated regularization approaches.

All in all, Gradient Boosting with probabilistic outputs can be fairly helpful in case you need to assess the noise in your target variable. I wrote two other pieces here and here about why I believe that this is generally a good idea.

“All in all, Gradient Boosting with probabilistic outputs can be fairly helpful in case you need to assess the noise in your target variable.”

Let me know in the comments if you have any questions or feedback in general.

Leave Comment

Your email address will not be published. Required fields are marked *