We use a simple example to illustrate that causal inference by regression is unreliable in realistic cases where measurement noises are present.

Introduction

Causal inference from observed data offers a cost-effective alternative to randomized control trials. One method, described by Peters et al. (2017) and Molak (2023), involves assessing the independence between regression residuals and variables.

Example Illustration

Consider the data generated by Equation ($\ref{eqn:example}$):

\[\begin{array}{cl} x & \sim & N(0,1) \\ y & = & x^3 + N(0,\sigma_y^2) \end{array} \label{eqn:example}\]

where $N$ denotes the Gaussian random variable.

To infer causality, a Generalized Additive Model (GAM) regression of $y$ on $x$ is conducted, and the independence between the regression residual and $x$ is tested. The analysis is repeated in the opposite direction: regression of $x$ on $y$. The Hibert-Schmidt Independence Criterion (HSIC) assesses this independence (see this blog post), with a small p-value indicating dependence.

Figure 1 displays the result for $100$ data samples, generated by Equation ($\ref{eqn:example}$) with $\sigma_y=1$, confirming that $x$ causes $y$.

Figure 1. Causal inference analysis for data generated by Equation (1) with σy = 1. The scatter plot on the left displays the relationship between x and y. The middle plot shows the residual vs. x of the GAM regression of y on x, indicating independence between the residual and x with the HSIC test p-value being 0.62; hence, x is inferred to cause y. The plot on the right demonstrates the residual vs. y from the GAM regression of x on y. Here, the HSIC test p-value is 0.01, signifying dependence between the residual and y; consequently, y is not considered the cause of x. The inferred causal direction is xy.

Effect of Noises

Real-world scenarios involve measurement noises in both $x$ and $y$. The impact of noise on causal inference is evaluated using Equation ($\ref{eqn:noise}$), varying $\sigma_x$ while keeping $\sigma_y=1$.

\[\begin{array}{cl} x_0 & \sim & N(0,1) \\ y_0 & = & f(x_0) \\ x & = & x_0 + N(0,\sigma_x^2) \\ y & = & y_0 + N(0, \sigma_y^2) \end{array} \label{eqn:noise}\]

The following $R$ code performs the calculations. For each specified value of $\sigma_x$, it generates $100$ data points for both $x$ and $y$. Subsequently, it computes the p-values of the HSIC tests, assessing independence between variables and regression residuals. This entire analysis is repeated $100$ times.

regress <- function(func, n, sigma_x, sigma_y){
    # generate data y=func(x) with Gaussian noises added to 
    # both x and y,
    # then regress y on x and x on y and compute the correlation 
    # between the predictor and residuals
    #
    # Inputs:
    #    func: a univariate function
    #    n: number of data samples to be generated
    #    sigma_x: standard deviation of Gaussian noise added to x
    #    sigma_y: standard deviation of Gaussian noise added to y
    
    x0 <- rnorm(n)
    y0 <- func(x0) 
    y <- y0 + rnorm(n)*sigma_y
    x <- x0 + rnorm(n)*sigma_x

    # regression y ~ x
    y_on_x <- mgcv::gam(y~s(x))
    pv.y_on_x <- dHSIC::dhsic.test(x, y_on_x$residuals)
    # regression x ~ y
    x_on_y <- mgcv::gam(x~s(y))
    pv.x_on_y <- dHSIC::dhsic.test(y, x_on_y$residuals)

    
    return(list(data=tibble::tibble(x=x, y=y),
                pv_yx = pv.y_on_x,
                pv_xy = pv.x_on_y
                )
           )
    }

n_boots <- 100
n_samples <- 100
sigma_y <- 1
p_y_x <- c()
p_x_y <- c()
index <- c()
sigma_x <- c()

for(s in c(0, 0.1, 0.25, 0.5, 0.75, 1)){
    for(i in 1:n_boots){
        r <- regress(function(.x) .x^3, n_samples, s, 1)
        index <- append(index, i)
        p_y_x <- append(p_y_x, r$pv_yx$p.value)
        p_x_y <- append(p_x_y, r$pv_xy$p.value)
        sigma_x <- append(sigma_x, s)
                    }
    result <- tibble::tibble(
            sigma_x = sigma_x,
            index = index,
            p_yx = p_y_x, #p-value for y~x
            p_xy = p_x_y) #p-value for x~y
                }

Figure 2 illustrates the significant influence of noise on causal inference. The distribution of $\Delta = pv_{y\sim x} - pv_{x\sim y}$ is depicted. Here, $pv_{y\sim x}$ represents the p-value from the Hibert-Schmidt Independence Criterion (HSIC) test between x and the residual of the regression of $y$ on $x$, while $pv_{x\sim y}$ is the p-value from the HSIC test between $y$ and the residual of the regression of $x$ on $y$.

For small $\sigma_x$, $\Delta>0$, suggesting a causal direction of $x \rightarrow y$. Conversely, when $\sigma_x$ is large, $\Delta < 0$, indicating a causal direction of $y\rightarrow x$. Notably, at $\sigma_x = 0.25$, the distribution of $\Delta$ centers around 0, signifying an inability to infer a causal direction clearly.

Figure 2. Effect of noise on the causal inference. Each panel corresponds to a specific value of σx, with σy held constant at 1 for all cases. Δ > 0 indicates x causing y, while Δ < 0 implies y causing x.

Figures 3 and 4 show the examples of $\sigma_x=0.25$ and $\sigma_x=1$, respectively.

Figure 3. The plots mirror those in Figure 1, but with σx set to 0.25 and σy set to 1. In this instance, no causal direction can be inferred.
Figure 4. The plots mirror those in Figure 1, but with σx and σy both set to 1. The inferred causal direction is yx.

Discussion and Summary

When both $x$ and $y$ contain noise, their relationship resembles a Directed Acyclic Graph (DAG) with a cofounder, as depicted in Figure 5. Both $x$ and $y$ are influenced by the confounder $x_0$, introducing noise components $N_x$ and $N_y$ to each, respectively.

Figure 5. Equivalent DAG with a confounder.

Intuitively, if the inverse function $f^{-1}$ exists, there is no distinguishable causal direction between $x$ and $y$. However, when $f^{-1}$ is non-existent, a causal direction from $x$ to $y$ emerges. For example:

\[\begin{array}{cl} x_0 & \sim & N(0,1) \\ y_0 & = & x_0^2 \\ x & = & x_0 + N(0,\sigma_x^2) \\ y & = & y_0 + N(0, \sigma_y^2) \end{array}\]

In Figure 6, the function that maps $x$ to $y$ is not invertible: $y$ as a function $x$ yields a single value for each $x$, while $x$ as a function of $y$ has two values for the same $y$. This suggests a causal direction of $x\rightarrow y$ based on function invertibility.

Figure 6. The function that maps x to y (the red curve) is not invertible, suggesting the causal direction of xy.

Despite the theoretical interest in causal inference by regression, its robustness is compromised when measurement noises are present, as demonstrated in this study. The causal inference results are significantly influenced by noise, underscoring the need to incorporate the data-generating mechanism for more robust causal inference.

References

Peters, J., Jazzing, D., Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press

Molak, A. (2023). Causal Inference and Discovery in Python. Packt Publishing