Bayesian harmonic regression and a poor man’s spike-and-slab prior

BayesianHarmonicRegression

Introduction

Often in time-series forecasting, we only have access to small datasets - not because somebody messed up the data collection process but because records beyond a certain point in the past simply don't exist. This makes forecasting such data exceptionally difficult and trying to fit sophisticated deep-learning models will easily turn into an overfitting nightmare.

While $L_{1/2}$ regularization techniques might be helpful, I have become much more convinced of bayesian methods to tackle such problems over the years. My main argument in favor of bayesian modeling is their ability to deal with prior knowledge and the possibility to quantify parameter uncertainty. By parameter uncertainty (I will explain in a future post why I wouldn't use the more popular term 'model uncertainty'), I mean the possibility that another parameter configuration might lead to a similarly good description of our data.

Today, I want to use such a bayesian approach in combination with a harmonic regression model in order to model the employment to population ratio for Germany. This time-series appears to be driven by periodic patterns which also makes sense from an economic perspective where many phenomena are presumed to happen in a cyclic manner. We will therefore use sines and cosines to model such behavior.


Preparing workspace and dataset

At this point, I am mostly working with Julia whose advantages I outlined briefly in a past blogpost. If you haven't tried out Julia yet, I can only recommend to give it a shot - in my opinion, it is definitely worth it.

While there are already a lot of very cool Julia packages for bayesian statistics and machine learning (Turing being one of them), I preferred to implement a variational inference scheme myself using mostly the Flux autodiff toolset:

In [1]:
using Flux
using Zygote
using Plots
using StatsPlots
using Distributions
using DistributionsAD
using DSP
using LinearAlgebra
using Random
using CSV
using DataFrames
using StatsBase
using KernelDensity


As already mentioned, the dataset consists of thirty years of yearly employment to population ratio, a metric that ranges between 0 and 100 percentage points. We'll remove the date axis for this example and only focus on the actual time-series.


In [2]:
df = CSV.File("SLEMPTOTLSPZSDEU.csv") |> DataFrame
y = Matrix(transpose(df[:,2][:,:]))

plot(y[:], legend=:topleft, label="Employment/Population ratio", title = "Raw time-series, dates removed")
Out[2]:


In order to ease working with gradient descent algorithms, let's normalize the data. We won't do any model validation today, so there is no need to be careful about leakage of future information into the training set here.


In [3]:
ym = mean(y)
ys = std(y)

y = (y.-ym)./ys
Out[3]:
1×30 Array{Float64,2}:
 0.653711  0.147278  -0.463609  …  1.41743  1.64175  1.72394  1.71495


Since we are using a harmonic regression model, we need to manually define and create the inputs from the time-axis. The most obvious way to do that is to simply use the sequence from $1$ to $T$, where $T$ is the length of our time-series. I will also create a slightly longer test-series that contains ten additional 'years' in order to inspect out-of-sample forecasts from our model.

In addition, the time steps in the test-series will be smaller in order for the resulting harmonic function plots to look a little smoother. For the training sequence, this wouldn't make sense as we don't observe any values between years.

In [4]:
train_time = Matrix(transpose(collect(1:size(y)[2])[:,:]))
test_time = Matrix(transpose(collect(1:0.1:(10+size(y)[2]))[:,:]))
Out[4]:
1×391 Array{Float64,2}:
 1.0  1.1  1.2  1.3  1.4  1.5  1.6  …  39.5  39.6  39.7  39.8  39.9  40.0


Crafting a customized Bayesian harmonic regression model

Let's start with the fundamental definition of our model. If we take another look at this source, we see that our general model structure should look like this:

$$\hat{y}_t=\mu + \sum_{h=1}^H\alpha_h\cdot sin(2\pi\omega_h t) + \sum_{h=1}^H\beta_h\cdot cos(2\pi\omega_h t)+\epsilon,$$


with the respective righthand side elements as follows:

  • $\mu$ is simply the mean or bias of our model. Even though we normalized our time series to mean zero, the possibility to move the preditictions up and down the y-axis, provides a welcome increase in overall model flexibility.

  • $H$ denotes the amount of frequencies to consider. Since a harmonic model can deal with time on an arbitrary level of precision, we could potentially include as many arbitrarily high frequencies as we like to. However, this does not provide any further value beyond the level of precision available in our training data. If we only have yearly observations at our disposal, it doesn't make much sense to include frequencies for daily fluctuations. The same goes for arbitrarily low frequencies - with 30 years of training data, we shouldn't try to account for a 100 year seasonal pattern.

  • $\alpha_h,\beta_h$ account for the actual regression factors of the harmonic components. Those are the most important quantities we need to estimate or learn respectively.

  • The $\omega_h$ define the actual frequencies under consideration. While we could try to learn the best possible frequencies as well, this is quite difficult in practice from my experience. Hence, we will use a heuristic to choose our set of frequencies.

  • $\epsilon$ is the noise term. We could also craft a valid bayesian model without accounting for data noise, but I highly recommend using a meaningful probabilistic formulation as emphasized in part one and part two of my posts on why probability shouldn't be neglected.

Equipped with the general structure we are now able to concretize each component.

I) Creating the model features through $\omega_h$

For $\omega_h$ and $H$ respectively, I based my approach on the recommendations from the harmonic regression source mentioned before. However, instead of creating $H=T/2$ different $\omega_h$, evenly spaced in the interval $[1/T,0.5]$, I decided to use only $T/2-5$, calculated as follows (assume for the remainder of this section that $T$ is either an even number or else replace $T$ by $T+1$):

  1. Create $\tilde{\omega}_1$ to $\tilde{\omega}_{T/2}$ as $T/2$ evenly spaced points in $[1/T,0.5]$
  2. Set $\omega_1=\tilde{\omega}_1/2$
  3. Set $\omega_{[2:10]}=\tilde{\omega}_{[1:9]}$
  4. Use $\omega_{[1:10]}$ for the model, i.e. $H=10$

This procedure removes the top-5 highest frequencies and adds an additional, lower one. The rationale behind this approach was to account for the rather slow drifts that became apparent in the time-series plot. While we are going to apply some nice Bayesian regularization methods to our model, we can already express our prior belief of non-existent high frequencies by simply removing the respective features.


II) Choosing an appropriate noise term $\epsilon$

Although $y_t$ is constrained to be in $[0,100]$, I decided to still use a zero mean normal distribution for the noise, $\epsilon\sim\mathcal{N}(0,\sigma)$. This can obviously only be an approximation at best, since our model predictions are then defined over $\mathbb{R}$. However, given that our training data only takes on values roughly in $[51,59]$ without large fluctuations, I presumed that the probability of realizations outside $[0,100]$ should be negligible for a well performing model.

The conditional likelihood of our predictions can now be written as

$$p(\hat{y}|t)\sim\mathcal{N}(\mu + \sum_{h=1}^H\alpha_h\cdot sin(2\pi\omega_h t) + \sum_{h=1}^H\beta_h\cdot cos(2\pi\omega_h t),\sigma).$$

If the normal approximation turned out to be problematic, we could easily try a truncated normal distribution or a beta distribution as possible, but less standard alternatives.


III) Poor man's spike-and-slab priors for $\alpha_h,\beta_h$

As we are using a Bayesian model, we need to specify prior distributions for our model parameters. Since the $\alpha_h,\beta_h$ are the most important parameters in our model, let's start with those. A fairly popular method is simply choosing $\mathcal{N}(0,\eta)$ as a prior distribution for all parameters. While this is a valid solution, I found this setting to be somewhat inappropriate here:

We can be relatively certain, that employment to population is only driven by a very limited amount of different seasonal patterns, probably one major and one minor cycle. Hence, the influence of all other frequencies on our forecast should ideally be exactly zero - not just fairly small in absolute terms. In my opinion, this prior assumption isn't expressed properly in those common zero-mean prior distributions but we want to actually exclude irrelevant features from having any effect at all.

Luckily this has been addressed in spike-and-slab priors; see also here for a more technical introduction. For any model parameter $\gamma$, a spike-and-slab prior distribution is defined in a two-stage manner:

$$ \theta\sim\mathcal{Ber}(p)\\ \gamma\sim\left\{ \begin{array}{ll} \mathcal{N}(0,\eta),& \theta=1 \\ \delta(0),& \theta=0 \\ \end{array} \right. $$



Where $\delta(0)$ dentoes a dirac-delta distribution with all probability mass at $0$. Put another way we multiply the draw from bernoulli random variable $\theta$ (spike) with the draw from a zero-mean gaussian variable (slab). Hence, $\gamma$ becomes a discrete-continuous mixture distribution with non-zero probability of being exactly zero. This poses a fairly obvious issue of using variational inference on this problem where we typically work with continuous distributions.

In order to solve this issue we can use a quick-fix continuous relaxation of the discrete bernoulli prior and define our prior as follows:

$$\gamma=s(C\theta)\cdot\kappa$$


with $s(x)=\frac{1}{1+e^{-x}}$ the well-known sigmoid function, $\theta\sim\mathcal{N}(\mu_\theta,\sigma_\theta)$, $\kappa\sim\mathcal{N}(0,\sigma_\kappa)$ and $C$ some positive constant. As $C\rightarrow\infty$, $\gamma$ becomes equivalent to a bernoulli random variable with parameter $p=P(\theta\geq 0)=1-CDF_{\mathcal{N}(\mu_\theta,\sigma_\theta)}(0)$. Hence, we can approximate our spike prior-belief by adjusting the mean of $\theta$ and fixing $C$ to a large value. On the other hand, we need to be careful not to set $C$ too large in order to avoid our gradients vanishing towards zero.

This formulation is obviously not a 100% clean spike-and-slab prior but should be a sufficiently good approximation of roughly the same belief. In addition, we don't need to rely on MCMC sampling tools for estimation but can employ the benefits of variational inference and straightforward stochastic gradient descent.

As concrete priors for this example, I set for all $\alpha_{h}$ (and $\beta_{h}$ accordingly)

$$\theta_{a_h}\sim\mathcal{N}(-1,1)\\ \kappa_{a_h}\sim\mathcal{N}(0,1)\\ $$

and any of the corresponding coefficients is calculated as $\alpha_h(\theta_{\alpha_h}, \kappa_{\alpha_h})=s(C\theta_{\alpha_h})\cdot\kappa_{\alpha_h}$ (respectively, $\beta_h(\theta_{\beta_h}, \kappa_{\beta_h})=s(C\theta_{\beta_h})\cdot\kappa_{\beta_h}$). Also, I set $C=100$ which proved both large enough for reasonable approximation while still preserving meaningful gradients. With the 'spike' prior's mean parameter at $\mu_{\theta_h}=-1$ we have roughly a $1-CDF_{\mathcal{N}(-1,1)}(0)\approx0.15$ prior probability of the corresponding frequency being relevant.

We see that we express our prior belief about the model coefficients only implicitly via $\alpha_h(\cdot,\cdot)$ without calculating their exact distribution. Also, we assume independence among all parameters, which is reasonable unless we had a more substantial clue about the seasonal patterns of our time-series.


IV) Priors for the remaining parameters

As you might have noticed, our time-series model still lacks prior distributions for the model mean parameter $\mu$ and model variance $\sigma$. In order to make our lives easier later on, we use normal distributions once more:

$$\mu\sim\mathcal{N}(0,1)\\ \log(\sigma)\sim\mathcal{N}(0,1).$$

The second prior is again a little hacky as we would ideally express our prior belief directly for standard deviation $\sigma$ itself, e.g. via a lognormal distribution. However with this approach, we end up with all our prior distributions being gaussian which will make variational inference much easier later on. Think of this as a pragmatic but technically rather sloppy approach.


V) Variational inference and ELBO

As I've just mentioned, using all-gaussian, independent priors, variational inference becomes much easier to do. If you need an introduction to variational inference, I recommend Martin Krasser's post on variational inference neural networks which I believe is really accessible and doesn't come with too many formulas to begin with.

We now have our prior distribution $p(\phi)=p(\mu)p(\log(\sigma))\prod_{h=1}^Hp(\theta_{\alpha_h})p(\kappa_{\alpha_h})p(\theta_{\beta_h})p(\kappa_{\beta_h})$ being a product distribution of independent gaussians as described in the previous two paragraphs; equivalently, we can use a $2H+2$-dimensional multivariate gaussian distribution $p(\phi|m_p,\Sigma_p)$ with diagonal covariance matrix. Notice that I use $\phi$ as a shorthand notation for the concatentation of all our model parameters.

Given that our priors are all gaussian, it is sensible to consider gaussian variational distributions here as well. As in additiona the amount of parameters is rather low, I decided to use an actual multivariate variational gaussian distribution $q(\phi|m_q,\Sigma_q)$ with full covariance. That way, dependencies among the posterior marginals can actually be expressed and won't be lost as would be the case with independent variational marginals.

Putting everything so far together, we obtain the following evidence lower bound, ELBO:


$$ELBO=\mathbb{E}_{q(\phi|m_q,\Sigma_q)}\left[\log p(y|t,\phi)\right]-KL\left(q(\phi|m_q,\Sigma_q)||p(\phi|m_p,\Sigma_p)\right)$$



For $\mathbb{E}_{q(\phi|m_q,\Sigma_q)}\left[\log p(y|t,\phi)\right]$ we won't find an easy analytical expression and hence approximate the expected log-likelihood via Monte-Carlo sampling from $q(\phi|m_q,\Sigma_q)$. We can apply the usual reparametrization trick by decomposing $\Sigma_q$ into $L_q^T L_q$. This enables us to draw samples from $q(\phi|m_q,\Sigma_q)$ by sampling from an $2H+2$-dimensional standard multivariate gaussian distribution $G\sim\mathcal{MVN}(0,I)$ and applying transformation $L_q G+m_q\sim\mathcal{MVN(m_q,\Sigma_q)}$. This allows us to calculate the gradients with respect to the variational distribution's parameters.

Finally, we need to calculate the Kullback-Leibler divergence between prior and variational distribution, which is luckily available in closed form, given we are working with two multivariate gaussians (see this post for reference):


$$KL\left(q(\phi|m_q,\Sigma_q)||p(\phi|m_p,\Sigma_p)\right)=\\ \frac{1}{2}\left[2H+2 - log(det(\Sigma_q))+tr(\Sigma_q)+\sum_j^{2H+2}(m_{qj}-m_{pj})^2\right] $$

where $m_{qj}$ denotes the $j$-th element of mean-vector $m_q$. Having unit variance (standard deviation) for the priors apparently allows for heavy simplification of the KL-term.

Thus, our ultimate objective is to maximize the $ELBO$ with respect to $m_q, \Sigma_q$.


Putting things into code

At last, we are ready implement our model and the necessary objectives in Julia code. Let's start with the raw model itself:

In [20]:
mutable struct HarmonicBayes
    
    #put the prior only into the KL-term, we don't necessarily need it here
    
    variational_m
    variational_L #use a decomposed representation of the covariance
    
    Ω #put all ω_j into one vector, remember to exclude it from gradient descent
        
end


function HarmonicBayes(train_time)
    
    T = collect(1:train_time)
    
    Ω = periodogram(T).freq[:][1:10] #DSP.periodogram creates the initial ω_j quite easily
    Ω[1] = Ω[2]/2
    
    variational_m = randn(1,2+4*length(Ω))./train_time #scale variational parameters to avoiv bad local minima with high variance
    
    a = randn(2+4*length(Ω),2+4*length(Ω))./train_time
    
    variational_L = Matrix(cholesky(transpose(a)*a).U) #use cholesky as a nice triangular initialization of L
    
    Ω = Ω[:,:]

    
    HarmonicBayes(variational_m, variational_L, Ω)
end


function (m::HarmonicBayes)(x)
    
    K = size(m.variational_L)[2]
    
    S = transpose(reshape(collect(1:Int(K-2))[:,:],(Int((K-2)/4),4)))
    
    factor_sample =  randn(1,K) * m.variational_L .+ m.variational_m
     
    mu_sin = (σ.(factor_sample[1:1,S[1,:]].*100).*factor_sample[1:1,S[2,:]]) * sin.(2pi.*m.Ω.*x) 
    mu_cos = (σ.(factor_sample[1:1,S[3,:]].*100).*factor_sample[1:1,S[4,:]]) * cos.(2pi.*m.Ω.*x)
    
    sigma = exp(factor_sample[1,end-1]) #re-transform model standard deviation
    bias = factor_sample[1,end]
    
    return bias .+ mu_sin.+mu_cos, sigma
end
In [21]:
function loglikeloss(m::HarmonicBayes, x, y)
    
    mu, sigma = m(x) 
    
    -sum(logpdf.((Normal.(mu[:],sigma)),y[:]))
end



get_diag(diagonal) = diagm(0=>diagonal) #small helperfunction
Zygote.@nograd get_diag


function klloss(m::HarmonicBayes)
    
    d = size(m.variational_L)[2]
    S1 = transpose(m.variational_L)*m.variational_L #
    
    #bulk calculation of all prior means
    prior_means = hcat( -1 .*ones(1,Int((d-2)/4)), zeros(1,Int((d-2)/4)), -1 .*ones(1,Int((d-2)/4)), zeros(1,Int((d-2)/4)), zeros(1,2))
    
    mr = m.variational_m.-prior_means
    
    0.5*(-d-log(det(S1))+tr(S1).+sum(mr.^2))[1] #see KL-formula above
end



function loss(m::HarmonicBayes, x, y,n=20) #'loss' = negative elbo --> minimizing this is equivalent to maximizing the elbo
    
    kloss = klloss(m)
    
    ll_buffer = Zygote.Buffer(zeros(n),n) #need buffer to store gradients over loop
    
    for i in 1:n
        ll_buffer[i] = loglikeloss(m,x,y) #mc sampling
    end
    
    mean(copy(ll_buffer)) + kloss
end
Out[21]:
loss (generic function with 2 methods)


Next comes model training. Notice that I used random seeds a lot in order to make everything fully reproducible.

In [22]:
Random.seed!(123)
model = HarmonicBayes(size(train_time)[2])

params = Flux.params(model.variational_m, model.variational_L)
opt = ADAM(0.005)
Out[22]:
HarmonicBayes([0.03967559603287589 0.06827265692630799 … -0.03749477769645206 -0.011479863770771424], [0.24135946238058875 -7.725313379691138e-5 … -0.029732569657401196 0.040623592300501034; 0.0 0.21004979441336247 … 0.05187613306586608 0.03552381811517941; … ; 0.0 0.0 … 0.020972344043625853 0.06326780490439869; 0.0 0.0 … 0.0 3.517901747649959e-5], [0.016666666666666666; 0.03333333333333333; … ; 0.26666666666666666; 0.3])
In [23]:
Random.seed!(123)
for i in 1:1000
    grads = Zygote.gradient(()->loss(model, train_time,y,50),params)
    Flux.Optimise.update!(opt, params, grads)
    
    if i%100==0
        println(loss(model, train_time,y,25))
    end
end
Out[23]:
ADAM(0.005, (0.9, 0.999), IdDict{Any,Any}())


Inspecting the results

Now, we can enjoy our newly trained bayesian harmonic model and inspect its results. We are looking for the posterior predictive distribution given the parameter posteriors. The posterior predictive is unfortunately intractable as well and we have to rely on sampling once more. Luckily, sampling from our target is very cheap so we can create a fairly large sample in order to obtain a highly reliable estimate of the true posterior predictive distribution.

In [46]:
Random.seed!(123)
pp_samples = []
mu_samples = []


for i in 1:30000

    m, s = model(test_time)
    push!(mu_samples, m.*ys.+ym)
    push!(pp_samples, (randn(1,size(test_time)[2]).*s .+m).*ys.+ym)
    
end

pp_mu = mean(vcat(pp_samples...),dims=1)[:]
pp_std = std(vcat(pp_samples...),dims=1)[:]

mu_mu = mean(vcat(mu_samples...),dims=1)[:]
mu_std = std(vcat(mu_samples...),dims=1)[:]


pp = (Normal.(pp_mu, pp_std)) #calculate some predictive intervals, this 
lower = quantile.(pp,0.025) 
upper = quantile.(pp,0.975)

mup = (Normal.(mu_mu, mu_std))
lowermu = quantile.(mup,0.025)
uppermu = quantile.(mup,0.975)


plot(test_time[:],mean.(pp),label="Aleatoric uncertainty (noise)", title="Posterior predictive distribution and 10 year forecast",
        legend=:topleft;ribbon=(mean.(pp).-lower, upper.-mean.(pp)),color="blue")
plot!(test_time[:],mean.(mup),label="Epistemic uncertainty (parameters)";ribbon=(mean.(mup).-lowermu, uppermu.-mean.(mup)),color="orange")
plot!(test_time[:],mean.(mup),label="Posterior predictive mean",color="green", lw=2)
scatter!(train_time[:],y[:].*ys.+ym,label="Training data",color="red")
Out[46]:


We see that our model produced very reasonable results. The training data lies in a 95% confidence interval and a long-term trend has been identified correctly. The forecast itself looks reasonable up to the point where the forecast intervals are not getting wider over time. Our model is unfortunately not expressive enough to reasonably express an increase in variance and hence uncertainty over time. This is one of the first points of improvement I would consider, as uncertainty should increase the further we are trying to forecast the future.

Initially, I considered modeling the model variance as some gaussian process $GP(m(t),k(t,t'))$, where the mean could have been modeled by our actual mean model. For the covariance/kernel, we could have chosen something stationary, for example a plain Wiener Process. While this would have probably resulted in a more reasonable forecast intervalt, this post could have easily doubled in length. You should nevertheless get the general idea.

Still we can do some more inspection, for example draw a few actual posterior predictive samples:

In [48]:
Random.seed!(123)
plot(test_time[:],model(test_time)[1][:].*ys.+ym,legend=:none, title="Posterior predictive draws")
plot!(test_time[:],model(test_time)[1][:].*ys.+ym)
plot!(test_time[:],model(test_time)[1][:].*ys.+ym)
plot!(test_time[:],model(test_time)[1][:].*ys.+ym)

scatter!(train_time[:],y[:].*ys.+ym,color="red")
Out[48]:


Besides three draws that look fairly similar to the predictive means, there is one sample that seems to better express the shorter-term seasonal patterns that are slightly visible in the training data. Given that we actually put quite some regularization on the parameters in the spike part of the spike-and-slab prior, it appears reasonable that the model turned out to have low posterior probability of this more complex explanation (long-term + short-term instead of long-term only) being relevant.

Indeed, three samples indicate the presence of a long term trend only. Some evidence points to the existence of an additional short-term trend which might have been more probable had we not expressed such a strongly regularizing prior belief. Nevertheless, the short-term seasonality pattern seems to get less prevalent at the end of the training data, so this might actually result in better forecasts anyway.

Given that this post is already quite packed, this concludes our posterior analysis. It might be interesting and fruitful to take a closer look at the respective posterior frequency probabilities and check if the relevant frequencies make sense from a theoretical point of view.


Conclusion

As we have seen, the variational inference toolkit provides us with a lot of powerful tools for flexible bayesian modeling. Personally, I prefer variational inference as an alternative to MCMC methods - besides being easier to use for larger datasets, I find that MCMC algorithms require a lot of finetuning in order to converge to a sensible result. This is of course only my personal experience and somebody who is better trained in applying these methods, might easily tell you the opposite.

Hopefully, this article gave you some inspiration to dive a little deeper into the seemingly endless options that are possible with bayesian priors. Personally, I think that working with better priors yields a lot of potential to improve on small datasets given an ever increasing set of new machine learning models.

Leave Comment

Your email address will not be published. Required fields are marked *