Why probability and uncertainty should be an integral part of Regression models (I)



Thanks to many great developers and modern ML-libaries, training advanced Machine Learning models today can be done as easily as writing a few lines of Python code without having to deal with the mathematical complexity of the underlying algorithms.

This development has helped making the formerly arcance arts of self-learning algorithms available to a much broader range of people from diverse backgrounds and subsequently paved the way to solve many domain specific problems.

Often, it looks like Machine Learning is able to see patterns within seemingly random and noisy data. However personally, I sometimes get the impression that there's a widespread believe that given enough data, Machine Learning will one day be able to eliminate every element of chance from our world.

While there are elements of randomness that can be reduced or sometimes even fully eliminated with more and more data, there are other aspects of stochasticity that will hardly or never go away, regardless of how much data is used to tackle a given modeling task.

Model uncertainty VS. inherent randomness

Typically, the problem of randomness or uncertainty in modern Machine Learning literature (see the introductory paragraphs from here for an overview) is split into uncertainty that stems from the ML model having insufficient or even wrong knowledge about the task at hand on the one side and nondeterminism in the data generating process itself on the other side. As a simple mental model, you can use the following decomposition:


The formula does not necessarily mean that model uncertainty and nondeterminism both add up in an exact or meaningful manner but should merely demonstrate what aspects to consider when dealing with probability in Machine Learning. Although aspects of uncertainty in Machine Learning are typically mentioned in more safety-critical applications like self-driving cars that your Regression models might likely never face, I want to strongly encourage everyone to consider using probabilistic and tools for uncertainty regardless of application context.

For an example of model uncertainty, think of yourself trying to predict the outcome of a soccer match between two foreign teams - say red and blue - that you have never seen or heard of before. Without any further knowledge about the teams, your best guess would be to ascribe both of them a 50-50 chance of winning. However, you could easily update your knowledge by looking up both teams on the internet.

Once you find out about blue competing successfully in their country's major league while red have scored last place in the last four seasons, your best bet would be on blue winnning the game. You have just updated your own personal model of the world!

Nevertheless, unless you are into high-stakes gambling, you would not bet your whole life savings on blue as there isn't a guarantee that blue won't have a very bad day (certain of their victory, they might have been out for far too many drinks the night before the game) or red suddenly rising up from their past soccer slumber. The data generating process, which team will win or lose, is definitely not certain until the referee has ended the game.

This is the aspect of actual nondeterminism: No matter how much data or knowledge you collect, neither you nor any Machine Learning algorithm could predict the outcome with full certainty.

To point out another aspect that lies somewhere in between model uncertainty and data stochasticity, think of the common example of a fair coin flip, say the coin is unbiased and the flip itself is not flawed. Presuming fully deterministic laws of physics, you could argue that the outcome is only random because your knowledge about the world and hence your model is incomplete.

If you knew exactly all variables about the coin flip experiment - size and weight of the coin, force of the flip, position of all particles in the world - you could theoretically predict the outcome of the flip with 100% certainty. However, collecting all relevant data is clearly impossible and at least for the foreseeable future, we won't be able to build a model that can achieve the theoretically possible 100% accuracy in coin-flip forecasting.

Hence, we sometimes need to assume nondeterminism merely because there is no feasible way to get any closer to the ground truth process.

In this post and the next one, I want to focus primarily on data uncertainty in Regression problems and present a few examples that should stress why a probabilistic view on Regression can save you a lot of trouble.

Conditional Variance in Regression tasks

Personally, I deem the typical classification case to be inherently probabilistic by construction in most models. Most commonly used classification models operate on class probabilities as an output and fix the predicted class by running an $argmax$ over said probabilities. In addition, the very nature of discrete probability makes things somewhat nicer than the continuous case.

But this just as a very brief comparison, let's dive into the first example that is concerned with conditional variance.

Although the case of conditional variance, a.k.a. heteroscedasticity, is commonly taught in many introductory econometrics classes, I found it often falling short in an introductory Machine Learning context. Have a look at the following two examples in Julia:

In [9]:
using Plots
using Random
using Suppressor
using LaTeXStrings
using Flux
using Zygote
using Distributions 
using DistributionsAD


x1 = rand(250) .* 2 .- 1
y1 = randn(250)

x2 = rand(250) .* 2 .- 1
y2 = randn(250) .* (1. .+0.5 .*x2).^2

@suppress plot(scatter(x1, y1, ylims=(-6,6), title=L"y\sim\mathcal{N}(0,1)"),
            scatter(x2,y2,ylims=(-6,6),title=L"y\sim\mathcal{N}(0, (1+0.5x)^2)"),
            size = (900,400), legend=:none, xlab="x", ylab="y")

Going the commonly used, non-probabilistic road of predictive Regression modeling, the Mean-Squared-Error(MSE)-best model you could arrive at for this example is

$$\hat{y}=0+0\cdot x=0.$$

Graphically, this would obviously look like this:

In [10]:
@suppress plot(scatter(x1,y1, alpha=0.5,ylims=(-6,6),label=:none,
                scatter(x2,y2, alpha=0.5,ylims=(-6,6),label=:none,
                    title=L"y\sim\mathcal{N}(0, (1+0.5x)^2)"),
                size = (900,400), xlab="x", ylab="y")

plot!(collect(range(-1,1,length=100)),zeros(100), subplot=1, w=3, label="MSE-minimizer", legend=:topleft)
plot!(collect(range(-1,1,length=100)),zeros(100), subplot=2, w=3, label="MSE-minimizer", legend=:topleft)

While both model predictions make sense, the result is somewhat unsatisfactory for the second example. Although we know that the variance of $y$ varies quite a lot for different values of $y$, our model has now way to account for this and we are leaving important information on the table.

Imagine someone asking you for how certain you are about your prediction $\hat{y}_i$ given $X_i$, i.e. how much deviation from your prediction should we expect the actual realization to have? In the first example, your uncertainty would always be the same - in the second example however, you can see from the plot that the smaller $x$ the less deviation from zero we would expect for $y$.

The model we chose however is only able to express the mean of $Y|X$ as a function of $X$, i.e. we have

$$\mathbb{E}\big[Y|X \big]=f(X)$$

Unless there exist features $X=(X^{(1)},X^{(2)},...)$ such that $Y$ conditioned on all those features collapses to a deterministic variable, the mean would be insufficient to describe everything we know about $Y|X$.

Another way to demonstrate this is the commonly seen additive "mean-model plus error" formulation


- notice that this is equivalent to $Y\sim\mathcal{N}(f(x),\sigma^2).$; also in the theoretical sections of this post, I will parametrize the Normal distribution through their mean-variance notation. For all practical parts, I will use the mean-standard-deviation notation to stay consistent with Julia's parametrization. Both notations are more or less interchangeable but keep this in mind when taking a deeper look at the code.

While this formulation is probabilistic and more or less simplistic, what happens with MSE-error model fitting is that we lose the probabilistic aspect. In the end, we are only keeping the mean-term:

$$Y=f(X)\quad\big(=\mathbb{E}\big[Y|X \big]\big)$$

At first, the mean might look much more interesting than the variance term. However as seen above, once the mean becomes uninformative, we should take a further look.

You should also be aware that an uninformative mean does not necessarily mean that your Regression line is flat (or a hyperplane in higher dimensions). It could well be the case that there exists a non-degenerate functional form between your inputs and the mean of your target but due to high variance, the prediction becomes useless to derive any practical value from it even under the best circumstances.

In summary for the above example, there is no information in $X$ about the mean of $Y$ but rather about its variance. Hence, we now need to look for a model of the conditional variance:


Let's start with the Maximum Likelihood formulation of a plain Linear Regression model where $y_i$ is a univariate target and $X_i$ is a single observation of possibly multiple features and a potential intercept, $\beta$ is the typical parameter vector:

$$p(y_i|X_i;\beta, \sigma)=\phi(y_i;X_i\beta,\sigma)$$

where $\phi(x;\mu, \sigma^2)$ the pdf of a Normal distribution with mean $\mu$ and standard deviation $\sigma$ (switching back to mean-standard-deviation from now). By averaging over the log-pdfs we get the typical Maximum Likelihood objective


that we can either solve analytically or via white- or black-box (autodiff) Gradient Descent. In our case however, we have constant mean and varying variance so this objective and hence most prepackaged Linear Regression implementations won't work.

Luckily, with nowadays' modern automatic differentiation technology, we can simply reformulate the model to fit our case via

$$p(y_i|X_i;\mu, \alpha, \delta)=\phi(y_i;\mu,(\alpha + \delta X_i)^2)$$

and let packages like Tensorflow or in this case Zygote do the rest:

In [11]:
#initialize model parameters
param_μ̂ = randn(1)
param_α̂ = randn(1)
param_δ̂ = randn(1)

#define maximum likelihood objective for full dataset
ml_objective(y,X,μ,α,δ) = -mean(logpdf.(Normal.(μ[1],(α[1].+δ[1].*X).^2),y))
call_objective() = ml_objective(y2,x2,param_μ̂,param_α̂,param_δ̂)

params = Flux.params(param_μ̂, param_α̂, param_δ̂)

opt = ADAM(0.5)

#optimize parameters
for i in 1:300
    grads = Zygote.gradient(()->call_objective(), params)
    Flux.Optimise.update!(opt, params, grads)
    if i%60==0
        println("Iteration: $i Current loss: $(call_objective())")
Iteration: 60 Current loss: 2.9095537969865175
Iteration: 120 Current loss: 1.7282593456664699
Iteration: 180 Current loss: 1.3229449298869607
Iteration: 240 Current loss: 1.3224025175528205
Iteration: 300 Current loss: 1.322402047186301
In [12]:
println("μ̂ = $(param_μ̂[1]) - Ground truth: 0.0")
println("α̂ = $(param_α̂[1]) - Ground truth: 1.0")
println("δ̂ = $(param_δ̂[1]) - Ground truth: 0.5")
μ̂ = -0.003276088416309766 - Ground truth: 0.0
α̂ = 1.0057005534022454 - Ground truth: 1.0
δ̂ = 0.49543195911731375 - Ground truth: 0.5
In [13]:
X_interp = collect(-1:0.1:1)
std_interp = param_α̂[1] .+ param_δ̂[1] .* X_interp
mu_interp = repeat(param_μ̂,length(X_interp))

scatter(x2,y2,ylims=(-6,6),title="Predicted conditional distribution", label=:none, legend=:topleft)
plot!(X_interp, mu_interp; ribbon= 2 .*std_interp, w=3, label="Predicted Distribution", color="orange")

Apparently, our approach worked and we were able to closely learn the parameters of the underlying distribution. Now, this primarily worked because we already knew both the functional form and the correct distribution model (Gaussian).

While the latter case is not as easy to tackle in the general case, we can solve the former issue very conveniently by using two Neural Networks, $\psi_\mu$ and $\psi_\sigma$. As their naming implies, $\psi_\mu$ will provide us the mean and $\psi_\sigma$ the standard deviation of the Normal Distribution:

$$p(y_i|X_i;\psi_\mu, \psi_\sigma)=\phi(y_i;\psi_\mu(X_i), \psi_\sigma(X_i))$$

Of course we need to ensure that the output of $\psi_\sigma(\cdot)$ is again strictly positive to avoid computational issues. Also, we might need more data compared to the last example in order for the two models to learn the underlying process. This is due to the inductive bias of our model being much less restrictive and hence the models, particularly the variance model, could suffer from severe overfitting.

Let's create a more complex example and try to solve it with our approach:

In [14]:

#sampling double the data to avoid overfitting the networks; this is particularly challenging with 
#highly flexible distributional Regression models
X = rand(1,500) .* 6 .- 3
X_linspace = hcat(collect(-3:0.01:3)...)

#create some arbitrary nonlinear data
mu_generator(x) = 4*sin(0.5*x^2)
sigma_generator(x) = 1+4*(sin(0.5*x))^2

y = randn(1,500) .* sigma_generator.(X) .+ mu_generator.(X)
mu_linspace = mu_generator.(X_linspace)
sigma_linspace = sigma_generator.(X_linspace)

scatter(X[:],y[:], label=:none)
plot!(X_linspace[:], mu_linspace[:]; ribbon = 2 .*sigma_linspace[:], w=3,color="blue",
        label="Ground truth distribution", legend=:bottom)
In [15]:
ψ_μ = Chain(Dense(1,10,selu), Dense(10,10,selu), Dense(10,1))
ψ_σ = Chain(Dense(1,10,selu), Dense(10,10,selu), Dense(10,1), x->selu.(x).-selu(-Inf))

#define maximum likelihood objective for full dataset
ml_objective(y,X) = -mean(logpdf.(Normal.(ψ_μ(X)[:], ψ_σ(X)[:]),y[:]))
call_objective(y, X) = ml_objective(y, X)

params = Flux.params(ψ_μ, ψ_σ)

opt = ADAM(0.05)

Zygote.@nograd rand
#increase the training epochs; use batch training
for i in 1:1000
    samp = rand(1:500,200)
    Xbatch = X[:,samp]
    ybatch = y[:,samp]
    grads = Zygote.gradient(()->call_objective(ybatch, Xbatch), params)
    Flux.Optimise.update!(opt, params, grads)
    if i%200==0
        println("Iteration: $i Current loss: $(call_objective())")
Iteration: 200 Current loss: 1.322402047186301
Iteration: 400 Current loss: 1.322402047186301
Iteration: 600 Current loss: 1.322402047186301
Iteration: 800 Current loss: 1.322402047186301
Iteration: 1000 Current loss: 1.322402047186301
In [16]:
mu_predicted = ψ_μ(X_linspace)
sigma_predicted = ψ_σ(X_linspace)

scatter(X[:],y[:], alpha=0.5, label=:none, legend=:bottom, title="Predicted conditional distribution")
plot!(X_linspace[:], mu_linspace[:]; ribbon = 2 .*sigma_linspace[:], w=3,
        color = "blue", label="Ground truth distribution")
plot!(X_linspace[:], mu_predicted[:]; ribbon = 2 .*sigma_predicted[:], w=3, label="Predicted distribution",