This article explores how regression can be used to determine the dependence between variables in three types of directed acyclic graphs (DAGs): pipe, confounder, and collider. The theoretical analysis of these graphs can be found in the linked blog post.

Pipe DAG

In the pipe DAG, variables $X$ and $Z$ are independent when conditioned on Y; mathematically, this can be expressed as

\[P(X,Z \mid Y) = P(X\mid Y) P(Z\mid Y).\]

We verify this using regression by regressing Z on both X and Y. If the dependence of Z on X is flat, then X and Z are independent conditioned on Y.

We demonstrate this using a data set with non-linear dependencies and create three models: linear regression, gradient boosting, and GAM (generalized additive models).

n = 500
noise = 0.2
X = np.random.randn(n)
Y = X**2 + noise**np.random.rand(n)
Z = np.sin(Y/2) + noise*np.random.rand(n)

Figure 1. shows that without conditioning on $Y$, $X$ and $Z$ are dependent with a nonlinear relationship.

Figure 1. Pairwise correaltion plot among X, Y, and Z. It is clear that X and Z have nonlinear and nonmonotoic relationship.
import numpy as np
from alibi.explainers import ALE, plot_ale
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from pygam import LinearGAM, s

models = [LinearRegression(), 
          LinearGAM(s(0)+s(1), max_iter=1000, tol=0.0000001)]
[,Y) , axis=1), Z) for m in models]
fig, axes = plt.subplots(3,2, sharey=True, figsize=(6, 10))
for description, m, ax in zip(['Linear Regression', 
                               'GAM'], models, axes):
    ale = ALE(m.predict, feature_names=['X','Y'], target_names=['Z'])
    ale_exp = ale.explain(np.stack((X,Y), axis=1), min_bin_points=10)
    plot_ale(ale_exp, ax=ax, n_cols=2)

The ALE (accumulated local effects) plots in Figure 1 reveal that X and Z are independent when Y is included in the regression in the case of non-linear models, while the linear model shows dependence of Z on X, as it cannot account for the nonlinearity in the data.

Figure 2. ALE plots of Z's dependnce on X and Y, for three regression models: linear regression (top), Gradient Boosting Regressor (middle) and GAM (bottom). For the nonlinear regression models X and Z are independent, indicated by the flat line, whereas the linear model shows dependnce between X and Z because the model is incapable of handle the nonlineariaty in the data.

Confounder DAG

In the DAG with a confounder, $X$ and $Y$ are independent when conditioned on $Z$; mathematically, this is

\[P(X,Y \mid Z) = P(X\mid Z) P(Y\mid Z),\]

Again we generate some data. The relationships among $X$, $Y$ and $Z$ are linear.

n = 500
noise = 0.2
Z = np.random.randn(n)
X = 3*Z + noise*np.random.randn(n)
Y = -Z + noise**np.random.rand(n)

Figure 3 shows that without conditioning on the confounder $Z$, $X$ and $Y$ are depenent.

Figure 3. Pairwise correaltion plot among X, Y, and Z. It is clear that X and Z have linear relationship.

We regress $Y$ on $X$ and $Z$. As shown in Figure 4, when conditioned on the confounder $Z$, $X$ and $Y$ are independent.

Figure 4. ALE plots of Z's dependnce on X and Y, for three regression models: linear regression (top), Gradient Boosting Regressor (middle) and GAM (bottom). For all regression models X and Z are independent, indicated by the flat line.

Collider DAG

Finally, in the DAG with a collider, $X$ and $Y$ are dependent when conditioned on $Z$; mathematically expressed as

\[P(X,Y \mid Z) \neq P(X\mid Z) P(Y\mid Z).\]

We generate some data with linear relationships among $X$, $Y$ and $Z$.

n = 500
noise = 0.2
X = np.random.randn(n)
Y = np.random.randn(n)
Z = X + 2*Y + noise*np.random.randn(n)

Figure 5 shows that without condiitoing on the collider $Z$, $X$ and $Y$ are independent.

Figure 5. Pairwise correaltion plot among X, Y, and Z. It is clear that X and Y are independent.

We regress $Y$ on $X$ and $Z$. As shown in Figure 6, when conditioned on the collider $Z$, $X$ and $Y$ are dependent.

Figure 6. ALE plots of Z's dependnce on X and Y, for three regression models: linear regression (top), Gradient Boosting Regressor (middle) and GAM (bottom). For all regression models X and Y are dependent, when conditioned on the collider Z.

In this note, regression models and ALE plots were used to analyze variable dependence in DAGs, with simulation results aligned with theoretical analysis.