Introduction¶
In my last post I introduced a simple linear time-series model using indicator functions for forecasting. In the meantime, I was experimenting with some other ideas for non-complex models with good predictive power. One approach that has been on my mind for quite long now, is commonly known to Data Scientists as binning. A Statistician might be more familar with the name discretization or reducing the scale level of a random variable from numerical to categorical. An advantage of this technique is the reduction of noise - however, this comes at the cost of losing quite an amount of information.
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.naive_bayes import *
from sklearn.linear_model import LinearRegression
Dataset¶
Since this is another time-series forecasting model, I decided to stick with the passengers dataset from last time. I didn't use the shampoo dataset because the size of this dataset seemed to be too small for the model to achieve acceptable predictive power.
data = pd.read_csv("passengers.csv",index_col = 0).iloc[:,0]
plt.figure(figsize = (12,6))
plt.plot(data)
test_period = 36
train = data.iloc[:-test_period]
test = data.iloc[-test_period:]
Preprocessing¶
Contrary to the last model, this approach cannot deal directly with non-stationary data so the trend in the data has to be removed via plain differencing.
trend_removed = train.diff()
plt.figure(figsize = (12,6))
plt.plot(trend_removed)
As is visible in the plot of the differenced time-series, there is still a problem with heteroscedasticity which has to be solved as well. The variance of the dataset seems to be increasing over time, I assumed a linearly increasing variance:
$$\sigma^2_X(t)=t\cdot\sigma^2_X(1)=\sigma^2$$
If the assumption is correct, a simple division of $X$ by $\sqrt{t}$ should solve the problem:
t_train = np.arange(len(train)).reshape(-1,1)
trend_removed = trend_removed / ((t_train+1)**(1/2)).reshape(-1)
plt.figure(figsize = (12,6))
plt.plot(trend_removed)
The plot still looks a little non-stationary in the first few periods so I (arbitrarily) removed the first 5 datapoints as well.
trend_removed = trend_removed.iloc[5:]
Binning the data¶
To bin the data, $K\in\mathbb{Z^+}$ intervals on $\mathbb{R}$ are being assigned. Now, each continuous observation $x_t$ is replaced by an indicator $x_t^*=k,\quad k\in\left\{1;...;K\right\}$ where $k$ is the interval that $x_t$ falls in. While there are countless ways to define the intervals, I chose the intervals to be evenly spaced between $min(x_t)$ and $max(x_t)$. The number of intervals was chosen arbitrarily again - cross-validation could be applied here.
n_bins = 10
bins = np.linspace(trend_removed.min(), trend_removed.max(), n_bins)
binned = np.digitize(trend_removed, bins)
binned_series = pd.Series(binned, index = trend_removed.index)
The mean of realizations $x_t$ in each interval is saved in a dictionary in order to map the interval category back to actual realizations.
bin_means = {}
for binn in range(1,n_bins+1):
bin_means[binn] = trend_removed[binned == binn].mean()
To forecast future realizations, the classic approach of using $S\in\mathbb{Z}^+$ lagged realizations of $x_t^*$ will be applied. The amount of lags I chose was $12$, assuming that there is no longer auto-dependency of the process beyond a horizon of one year.
lagged_list = []
for s in range(13):
lagged_list.append(binned_series.shift(s))
lagged_frame = pd.concat(lagged_list,1).dropna()
train_X = lagged_frame.iloc[:,1:]
train_y = lagged_frame.iloc[:,0]
Training regressors and regressands now look like this:
train_X.head()
train_y.head()
To calculate the 'class'-means from before, I wrote a quick function that takes the predicted class as an input and returns the corresponding means.
def get_mean_from_class(prediction):
return(bin_means[prediction[0]])
Predictive Model¶
Since we are now dealing with a categorical variable, Naive Bayes looked like a reasonable and interesting model to try out - especially since the is no need to create dummy variables for the sklearn implementation. Interestingly, Bernoulli Naive Bayes produced non-sensical predictions although the regressors (train_X) make much more sense to assume as categorical variables. Using Gaussian Naive Bayes was much better, probably because there is still a natural order in the regressors - that's why I recommend to experiment with a large variety of models, even though some assumptions might not be met perfectly.
model = GaussianNB()
model.fit(train_X, train_y)
pred_insample = model.predict(train_X)
pred_insample = pd.DataFrame(pred_insample, index = train_y.index)
resulting_prediction = pd.Series(np.nan, index = train_y.index)
for row in range(len(pred_insample)):
resulting_prediction.iloc[row] = get_mean_from_class(pred_insample.values[row])
plt.figure(figsize = (12,6))
plt.plot(trend_removed)
plt.plot(resulting_prediction)
The in-sample predictions are looking good despite the fact that there are actually only 10 possible prediction values.
Out-of-sample forecasts need to be calculated iteratively since lagged values are required.
prediction_frame = pd.DataFrame(np.nan, index = test.index, columns = range(train_X.shape[1]))
predictions = pd.Series(index = test.index)
prediction_frame.iloc[0,1:] = train_X.iloc[-1,:-1].values
prediction_frame.iloc[0,0] = train_y.iloc[-1]
for i in range(len(test)):
pred = model.predict(prediction_frame.iloc[i,:].values.reshape(1,-1))
pred_num = get_mean_from_class(pred.reshape(-1))
predictions.iloc[i] = pred_num
try:
prediction_frame.iloc[i+1,1:] = prediction_frame.iloc[i,:-1].values
prediction_frame.iloc[i+1,0] = pred[0]
except:
pass
Finally, the forecast has to be retransformed:
trend_test = np.arange(len(train),len(train)+len(test)).reshape(-1,1)
final_prediction = predictions.cumsum()* ((trend_test+1)**(1/2)).reshape(-1)+train.iloc[-1]
Results¶
plt.figure(figsize = (12,6))
plt.plot(test)
plt.plot(final_prediction)
np.sqrt(np.mean((test-final_prediction)**2))
Pretty good result compared to the accuracy from last time. Nevertheless, there is still room for improvement, especially in terms of reducing the amount of selectable parameters in the "algorithm" - in that case the amount of bins could be selected automatically as this parameter is hard to choose properly. However, this post was just meant as a proof-of-concept prototype for this approach.
Victor
Neeraj Manandhar
Sarem