Why probability and uncertainty should be an integral part of regression models (II)



In the first part of this series on probabilistic machine learning, we were focusing on important information in the second moment of the target variable in a regression setting. Speaking more pragmatically, we could improve our model by looking at the conditional variance, a.k.a heteroskedasticity.

This required us to adjust our optimization criterion and use Maximum Likelihood instead of well known mean-squared or median-absolute error objectives. Today, we will look at so-called multimodal problems where the probabilistic paradigm can help us again with finding a solution that might be superior to the deterministic alternative.

A quick primer on multimodal data and mixture distributions

The challenge of multimodal data is quite well known in the statistics community but still - in my perception - not that much present in industrial-grade data science. While there certainly exist more severe issues than multimodality, it might nevertheless become a bigger roadblock in specific datasets. As an introductory example, we will start with a simple Gaussian distribution:

In [80]:
using Zygote
using Flux
using DistributionsAD
using LinearAlgebra
using Distributions
using Plots
using StatsPlots
using LaTeXStrings
using Random
In [6]:
plot(collect(-4.5:0.1:4.5),pdf.(Normal(),collect(-4.5:0.1:4.5)), label = "Population distribution")
density!(randn(2500), label="Sample distribution")

As you can see, we only have one mode (distributional peak/local density maximum) for the well-behaved Gaussian case. Imagine this was a sample of a quantity, say $Q$, from individuals from some group A. Now, let us introduce a second group B for which we sample the same quantity but from a different Normal distribution:

In [7]:
plot(collect(-4.5:0.1:4.5),pdf.(Normal(3,1),collect(-4.5:0.1:4.5)), label = "Population distribution - Group B", legend=:topleft)
density!(randn(2500).+3, label="Sample distribution - Group B")
In [8]:
plot(collect(-7.5:0.1:7.5),pdf.(Normal(),collect(-7.5:0.1:7.5)), label = "Population distribution - Group A", legend=:topleft)
plot!(collect(-7.5:0.1:7.5),pdf.(Normal(3,1),collect(-7.5:0.1:7.5)), label = "Population distribution - Group B", legend=:topleft)

Now suppose the data-generating process is as follows:

  1. Randomly select either group A or group B with probabilities $p_A$ and $p_B=1-p_A$
  2. Sample $Q$ from the selected group.

Equivalently, we could sample $Q$ from the union of both populations aggregated into a single population of size $N$ with $p_A\cdot N$ individuals from group A and $p_B\cdot N$ individuals from group B.

This data generating process is easily simulated - we take $p_A=0.6,p_B=0.4$:

In [9]:
population_draw = rand(2500)

a_draw = randn(2500)[population_draw.<=0.6]
b_draw = (randn(2500).+3)[population_draw.>0.6]

full_draw = vcat(a_draw,b_draw)


As you can see, the resulting sample distribution has two local maxima - modes - and looks like a mixture of the two Normal distributions. Not suprisingly, it is also called a mixture of normal distributions. Their probability density function is quite convenient:

$$p_{AB-Mixture}(x)=\pi_A\cdot p_{\mathcal{N}_A}(x)+\pi_B\cdot p_{\mathcal{N}_B}(x)$$

where $p$ denotes the respective density functions. We can obviously extend this to mixtures with arbitrarily many - say $C$-many - Gaussian components resulting in the density:

$$p_{\mathcal{N}-mixture}(x)=\sum_{c=1}^C \pi_c\cdot p_{\mathcal{N}_c}(x),\quad \sum_{c=1}^C \pi_c=1,\,\pi_c\geq0$$

Since we can easily express the density function in closed form, we can also use maximum likelihood for parameter estimation. While there exists a problem with identifiability of normal mixtures that should be mentioned, we won't bother with it here. However, I'd suggest anyone who is interested to try this method out in practice to read up on this issue, for example here.

Mixture models and switching regression

Following the above, it becomes quite easy to extend the idea of mixtures of distributions to mixtures of regression and classification models. Take for example a typical Gaussian linear regression model in its distributional representation:


where $p_{\mathcal{N}(x\beta,\sigma^2)}(y)$ is the probability density of a Normal distribution with mean $x\beta$ and standard deviation $\sigma$:


This makes it straightforward to define a mixture of $C$ Gaussian linear regression models through

$$p_{LM-mixture}(y|x;[\beta_C],[\sigma_C])=\sum_{c=1}^C\pi_c\cdot p_{\mathcal{N}(x\beta_c,\sigma_c)}(y)$$

where $[\beta_C]=\beta_1,...,\beta_C$ is simply used as an abbreviation of the series of parameter vectors for every mixture component (same goes for $[\sigma_C]$). Let's create a simple example dataset consisting of $C=2$ components:

In [67]:

X = rand(250) .* 4 .- 2

pi1 = 0.3
pi2 = 0.7

a1 = 3. 
b1 = 3.
sigma1 = 0.5

a2 = -3. 
b2 = 0.5
sigma2 = 1.

y = [rand(MixtureModel([Normal(a1+b1*X[i],sigma1),Normal.(a2+b2*X[i],sigma2)],[pi1, pi2])) for i in 1:250]

scatter(X,y, legend=:none)

Notice that the parameter vector $\beta$ is split into $a$ (intercept) and $b$ (coefficient) in order to make life a little easier by avoiding the addition of an additional one-vector to $X$.

Technically, we could still fit a model consisting of two linear models, $m_1$ and $m_2$, in a non-probabilistic way to this problem through an adjusted loss function:


The loss of the aggregate model equals the minimal loss generated by the best fitting model at any datapoint. In Julia, this looks as follows:

In [68]:
model = Dense(1,2) #a two output-neuron linear neural network is equivalent to two separate linear regression models

loss(x,y) = mean(minimum((model(x).-y).^2, dims=1)) #we use the mse for l(yhat,y)

params = Flux.params(model)
opt = ADAM(0.05)

for i in 1:200
    grads = Zygote.gradient(()->loss(transpose(X),transpose(y)), params)
    Flux.Optimise.update!(opt, params, grads)
    if i%25==0
In [69]:
scatter(X,y, legend=:topleft,label="Data",alpha=0.3)

plot!(collect(-2:0.1:2), model(transpose(collect(-2:0.1:2)))[1,:],lw=3, label="Model 1")
plot!(collect(-2:0.1:2), model(transpose(collect(-2:0.1:2)))[2,:],lw=3, label="Model 2")

Once both models have been fit, we can easily estimate the mixture probabilities ($\pi_1$ and $\pi_2$) by assigning each training datapoint to the model with lowest training error.

In [87]:
model_match = map(x->x[1],argmin((model(transpose(X)).-transpose(y)).^2,dims=1))[:]

pi_hat_1 = mean(model_match.==1)
pi_hat_2 = mean(model_match.==2)

println("Estimate π1: $pi_hat_1")
println("Estimate π2: $pi_hat_2")
Estimate π1: 0.308
Estimate π2: 0.692

However, this method becomes infeasible once the $\pi_i$ aren't static anymore but a direct function of $x$, for example $\pi_1=\sigma(x\gamma)$, with $\sigma(\cdot)$ being a sigmoid function:

In [96]:

X = rand(250) .* 4 .- 2

gamma = 0.75

a1 = 3. 
b1 = 3.
sigma1 = 0.5

a2 = -3. 
b2 = 0.5
sigma2 = 1.

y = [rand(MixtureModel([Normal(a1+b1*X[i],sigma1),Normal.(a2+b2*X[i],sigma2)],[σ(gamma*X[i]),1-σ(gamma*X[i])])) for i in 1:250]

scatter(X,y, legend=:none)

Notice how the scatter of the lower model's datapoints becomes much sparser for increasing values of $x$ and vice-versa for the upper model. At this point, we definitely need a probabilistic solution to this estimation problem - at least if we are interested in the probabilities of either model being 'picked' at a given datapoint. This can obviously be important in many practical cases. A maximum likelihood estimation for this can be implemented quite nicely in Julia:

In [134]:

model2 = Dense(1,2)

sigma_hat_1 = randn(1,1) #random initialization for all parameters
sigma_hat_2 = randn(1,1)

gamma_hat = randn(1,1)

function prob_mixture_loss(x,y)
    N = size(x)[2]
    lpdf_buffer = Zygote.Buffer(zeros(N),N)
    for i in 1:N
        mean_pred = model2(x[:,i][:,:])[:]
        component1 = Normal(mean_pred[1],abs(sigma_hat_1[1]))
        component2 = Normal(mean_pred[2],abs(sigma_hat_2[1]))
        pi_hat1 = σ(gamma_hat[1]*x[1,i])
        pi_hat2 = 1 - pi_hat1
        lpdf_buffer[i] = log(pi_hat1*pdf(component1, y[1,i])+pi_hat2*pdf(component2, y[1,i]))
    return -mean(copy(lpdf_buffer))

params = Flux.params(model2, sigma_hat_1, sigma_hat_2, gamma_hat)
opt = ADAM(0.1)

for i in 1:500 #bigger model, let it run a little longer
    grads = Zygote.gradient(()->prob_mixture_loss(transpose(X),transpose(y)), params)
    Flux.Optimise.update!(opt, params, grads)
    if i%100==0

Checking the $\gamma$-parameter that defines the change in component probabilities as $x$ is increasing, we see that the maximum likelihood estimation was successful:

In [149]:
println("True γ: $gamma")
println("Estimate for γ: $gamma_hat")
True γ: 0.75
Estimate for γ: [0.7497093072070276]

For the rest of the parameters, we can do another plot:

In [150]:
scatter(X,y, legend=:topleft,label="Data",alpha=0.3)

plot!(collect(-2:0.1:2), model2(transpose(collect(-2:0.1:2)))[1,:],lw=3, label="Model 1"; ribbon=2*sigma_hat_1)
plot!(collect(-2:0.1:2), model2(transpose(collect(-2:0.1:2)))[2,:],lw=3, label="Model 2"; ribbon=2*sigma_hat_2)

Looking good!

For real-world datasets, things might obviously not go as smoothly as when playing around with toy data. Mixture-models might actually be overkill or less efficient for unimodal data. As is often the case, a check of model residuals could give some clues if such a complex approach is necessary or not.


Hopefully, this post gave you another reason to take a deeper look into the world of the statistical modeling and learning. While there is certainly no silver-bullet that is appropriate for all possible modeling problems, the probabilistic approach can be a very powerful method in your toolbox. Apart from that, there are still more arguments in favor of probabilistic models left to demontrate which I will definitely do at some point in the future.

Leave Comment

Your email address will not be published. Required fields are marked *