Some interesting observations with Distance Correlation coefficients

DistanceCorrelation

Introduction

A common problem in multivariate exploratory data-analysis is finding relationships between different variables. While this can often be approached by plotting variables against each other, things quickly get tedious or infeasible when the amount of variables goes into the hundreds. If that is the case, it becomes necessary at some point to switch to quantitative measures of interaction and restrict plotting to a few interesting cases.

Hence, the question arises about what quantitative measure of interaction is the 'best' out of a set of many candidates. The commonly known Pearson Correlation coefficient already fails to capture non-linear relationships and its Spearman and Kendall variants still require some assumptions - especially continuity of the variables in question.

An interesting alternative that I personally deem to be very powerful is a fairly new measure called Distance Correlation.

Quick introduction to Distance Correlations

As the name implies, the concept is based on mathematical distances, namely the Euclidean one:

$$d(x,y)=\sqrt{\sum_{i=1}^d(x_i-y_i)^2}$$

For two samples $A:=a_1,...a_n$ and $B:=b_1,...,b_n$ (read 'feature vectors'), we need to calculate two distance matrices $D(A), D(B)$ - one for each feature with elements

$$D_{ij}(X)=d(x_i,x_j)$$

Next, we need to standardize both distance matrices which happens by subtracting the row- and column-means from and adding the matrix grand mean to the corresponding elements:

$$\tilde{D}_{ij}(\cdot)=D_{ij}(\cdot) - \bar{D}_{i\cdot}(\cdot) - \bar{D}_{\cdot j}(\cdot) + \bar{D}_{\cdot\cdot}(\cdot)$$

From there, we can calculate distance covariance and variances and finally the distance correlation:

$$dCov(A,B)=\frac{1}{n^2}\sum_{i,j}\tilde{D}_{ij}(A)\cdot\tilde{D}_{ij}(B)$$

$$dVar(A)=\frac{1}{n^2}\sum_{i,j}\tilde{D}_{ij}(A)^2,\quad dVar(B)=\frac{1}{n^2}\sum_{i,j}\tilde{D}_{ij}(B)^2$$

$$dCor(A,B)=\frac{dCov(A,B)}{\sqrt{dVar(A)\cdot dVar(B)}}$$

Notice that $0\leq dCor \leq 1$, with $1$ meaning perfect dependence and $0$ meaning statistical independence. While the original paper on distance correlation derives all results based on $A,B\subseteq\mathbb{R}^d$ - random real vectors - and Euclidean distances I was quite curious if Distance Correlations would produce sensible results with both these assumptions dropped.

Everything below should be taken with a grain of salt as the results are only of empirical nature and might not hold in the general case. Nevertheless with some pragmatism, the approach might add some neat techniques to the data-analysis toolbox.

In [1]:
using Distributions
using Plots
using Random
using LinearAlgebra
using MultivariateStats
In [2]:
function calculate_distance_matrix(X, dist_fun = (x,y)->sqrt(sum((x.-y).^2)))
    N = size(X)[1]

    distances = zeros(N,N)

    for row in 1:N
        for col in row:N
            distances[row,col] = distances[col, row] = dist_fun(X[row,:], X[col,:])
        end
    end

    return distances
end

function normalize_distance_matrix(M)
    col_means = mean(M, dims=1)
    row_means = mean(M, dims=2)
    grand_mean = mean(M)

    M .- col_means .- row_means .+ grand_mean
end


function calculate_distance_correlation(X, Y,
                                dist_fun_X = (x,y)->sqrt(sum((x.-y).^2)),
                                dist_fun_Y = (x,y)->sqrt(sum((x.-y).^2)))

    N = size(X)[1]

    dist_mat_X = calculate_distance_matrix(X, dist_fun_X)
    dist_mat_Y = calculate_distance_matrix(Y, dist_fun_Y)

    dmX_normalized = normalize_distance_matrix(dist_mat_X)
    dmY_normalized = normalize_distance_matrix(dist_mat_Y)

    distance_covariance = 1. /N^2 * sum(dmX_normalized .* dmY_normalized)
    distance_var_X = 1. /N^2 * sum(dmX_normalized .^ 2)
    distance_var_Y = 1. /N^2 * sum(dmY_normalized .^ 2)

    distance_correlation = distance_covariance / sqrt(distance_var_X * distance_var_Y)
end
Out[2]:
calculate_distance_correlation (generic function with 3 methods)

Capturing interaction effects in a feature selection problem for regression

Perhaps one of the most prominent problems when it comes to supervised learning: From a set of features or variables $X_i\subseteq\mathbb{R}^d$ we want to predict a target $Y_i\subseteq\mathbb{R}^e$ where often $e=1$. Most of the time, not all features actually have predictive value and often reduce the performance of our predictive model. Therefore, we try to select only the features that do exhibit predictive power.

Many measures of predictive power face amajor issue - feature interaction - which leads to features looking irrelevant on their own but become highly important when combined with other features. The classical correlation measures from before for example only look at marginal interactions of all variables on their own but completely lack a sense of interaction. Oftentimes, this problem can be solved to a satisfactory amount via model based measures of variable importance, for example the ones derived from models of the Decision Tree family.

Given the somewhat arbitrary results from these methods though - an interaction effect could be relevant in one model but not in another - it always makes sense to look at a multitude of different importance measures. As it turns out, the Distance Correlation could easily be tuned to provide a quantify such interactions.

In the univariate regression case, this seems to be achievable by allowing $A$ and $B$ to be of different dimension while keeping the Euclidean Distance on both. That means, we could look at two or more features from our data-table, calculate their distance matrix, do the same for our target and then calculate the Distance Correlation.

A simple example shows quite promising results:

In [3]:
Random.seed!(123)
X = hcat(rand(1000), rand(1000)) .*2 .- 1
Y = sin.(X[:,1]) .* (X[:,2]).*2

dCor1 = calculate_distance_correlation(X,Y)
dCor2 = calculate_distance_correlation(X[:,1],Y)
dCor3 = calculate_distance_correlation(X[:,2],Y)

plot(scatter(X[:,1], X[:,2], Y[:,1], title="Distance Correlation: " * string(dCor1)[1:6], legend=:topleft,
        label="Data sample"),
    scatter(X[:,1], Y[:,1], title="Distance Correlation: " * string(dCor2)[1:6], legend=:topleft,
        label="Data sample"),
    scatter(X[:,2], Y[:,1], title="Distance Correlation: " * string(dCor3)[1:6], legend=:topleft,
        label="Data sample"),

    size=(900,900))
Out[3]:
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 Distance Correlation: 0.1755 Data sample -1.0 -0.5 0.0 0.5 1.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 Distance Correlation: 0.0401 Data sample -1.0 -0.5 0.0 0.5 1.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 Distance Correlation: 0.0392 Data sample

In a no-noise setting with both variables being relevant, the Distance Correlation coefficient could be quite helpful. Looking at the variables on their own does not immediately make them appear relevant at all which is also seen from the Distance Correlation with the target. However, once both variables are considered together, the common coefficient increases significantly.

Another example on what happens in the noise case with one variable being completely irrelevant:

In [4]:
Random.seed!(123)
X = hcat(rand(1000), rand(1000)) .*2 .- 1
Y = X[:,1] .^ 2 .+ randn(1000) .* 0.1

dCor1 = calculate_distance_correlation(X,Y)
dCor2 = calculate_distance_correlation(X[:,1],Y)
dCor3 = calculate_distance_correlation(X[:,2],Y)

plot(scatter(X[:,1], X[:,2], Y[:,1], title="Distance Correlation: " * string(dCor1)[1:6], legend=:topleft,
        label="Data sample"),
    scatter(X[:,1], Y[:,1], title="Distance Correlation: " * string(dCor2)[1:6], legend=:topleft,
        label="Data sample"),
    scatter(X[:,2], Y[:,1], title="Distance Correlation: " * string(dCor3)[1:6], legend=:topleft,
        label="Data sample"),
    size = (900,900))
Out[4]:
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Distance Correlation: 0.0951