In this note, we estimate the regression prediction intervals using various conformal prediction methods. The regression model employed is a Gaussian Process regressor. We compare the confidence intervals generated by conformal prediction and Gaussian Process regression. Without conformal prediction, it is crucial to accurately estimate the variances of the observation noise and the predicted mean. This necessitates optimizing the kernel parameters and the noise variance to maximize the marginal likelihood. However, with conformal prediction, such a requirement is no longer necessary.

Gaussian Process Regression

A radial basis or squared exponential kernel is used in the Gaussian Process regression:

\[K(\mathbf{x}_i, \mathbf{x}_y) = \alpha \exp\left({-\frac{\mid \mathbf{x}_i-\mathbf{x}_j\mid^2}{2 \ell^2}}\right). \notag\]

When there is noise in the observation data, the covariance matrix becomes:

\[\Sigma = K(\mathbf{x}, \mathbf{x}) + \sigma^2 \mathbf{I}, \notag\]

where $\mathbf{x}$ is the observation data and $\sigma^2$ is the variance of the noise, and $\mathbf{I}$ is the identity matrix.

The kernel hyperparaemters $\alpha$, $\ell$ and the noise variance $\sigma^2$ can be optimized by maximizing the marginal likelihood of the observed data. Equivalently we minimize the negative log-likelihood:

\[L = \log(\mathrm{det}(\Sigma)) + (\mathbf{y}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{y}-\mathbf{\mu}),\]

where $\mathbf{y}$ is the observed data and $\mathbf{\mu}$ is the corresponding mean, which is assumed $0$ in this study.

In this study, we generate data from $y = x \sin x$ and then add Gaussian noise with $\sigma^2=1$.

import numpy as np
# training data
n = 100
noise = 1
rng = np.random.RandomState(1)
X_train = rng.random(size=n)*10
y_train = np.sin(X_train)*X_train + rng.normal(size=n)*noise
X_train = X_train.reshape(-1,1)
# new data
X_test = np.linspace(0, 10, 1000)
y_test = X_test*np.sin(X_test)
Optimizing Model Parameters by Marginal Likelihood Maximization

Next, we find the values of $\alpha$, $\ell$ and $\sigma^2$ that maximize the marginal likelihood:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, RBF
from numpy.linalg import det, inv
from scipy.optimize import minimize

def nll(θ, x, y):
    alpha, ell, sigma = θ
    Σ = alpha*RBF(length_scale=ell)(x) + sigma**2*np.eye(len(x))
    return np.log(det(Σ)) + y @ inv(Σ) @ y.T
  
opt = minimize(lambda theta, x, y: nll(theta, x, y)/len(x), [1,1,1], method='Nelder-Mead',args=(X_train,y_train))
opt
final_simplex: (array([[64.09115007,  2.08012245,  0.92224139],
       [64.09122573,  2.0801224 ,  0.92224147],
       [64.09108209,  2.08012175,  0.92224139],
       [64.09119223,  2.08012239,  0.92224131]]), array([1.25425296, 1.25425296, 1.25425296, 1.25425296]))
           fun: 1.2542529573068577
       message: 'Optimization terminated successfully.'
          nfev: 219
           nit: 124
        status: 0
       success: True
             x: array([64.09115007,  2.08012245,  0.92224139])

The optimal values are: $\alpha=64.1$, $\ell=2.08$ and $\sigma=0.922$.

Confidence Intervals for the Mean and for Individual Data Points

The optimized values of $\alpha$, $\ell$ and $\sigma$ are used to build the Gaussian Process regressor in sklearn and generate predictions and their confidence levels.

import matplotlib.pyplot as plt

alpha, ell, sig = opt.x
rbf = ConstantKernel(alpha, constant_value_bounds='fixed')*RBF(ell, length_scale_bounds='fixed')
gp = GaussianProcessRegressor(kernel= rbf, alpha=sig**2)
gp.fit(X_train, y_train)

ypred, spred = gp.predict(X_test.reshape(-1,1), return_std=True)

plt.scatter(X_train,y_train,c='k', label='Training Data')
plt.plot(X_test,y_test, 'C0', label=r"$f=x \sin x$");
plt.plot(X_test, ypred, 'C1', label='Prediction')
plt.fill_between(X_test, ypred-1.96*spred, ypred+1.96*spred, color='C1', alpha=0.25, label='Prediction Intervals of Mean');
plt.legend();
plt.xlabel(r"$x$")
plt.ylabel(r"$f$");
plt.title(rf"$\alpha={alpha:.1f}$; $\ell={ell:.2f}$; $\sigma = {sig:.3f}$");

The confidence interval from the Gaussian Process regressor is the $95\%$ confidence interval of the predicted mean, as shown in Figure 1.

Figure 1. Predicted mean and its 95% confidence interval generated by the Gaussian Process regression. The kernel hyperparameters and the noise variance are optimized to maximize the marginal likelihood.

To generate the confidence interval for individual data points, the variances of the mean prediction and the observation noise need to be combined:

\[\sigma_{\text{individual}}=\sqrt{\sigma^2_{\text{mean}}+\sigma^2},\notag\]

where $\sigma_{\text{mean}}$ is the standard error of the prediction generated by the Gaussian Process regression, $\sigma$ is the standard error of the noise in the observation, which is estimated by maximizing the marginal likelihood. The confidence intervals for individual data points are shown in Figure 2.

s = np.sqrt(spred**2 + sig**2) #combined variance
plt.scatter(X_train,y_train,c='k', label='Training Data')
plt.plot(X_test,y_test, 'C0', label=r"$f=x \sin x$");
plt.plot(X_test, ypred, 'C1', label='Prediction')
plt.fill_between(X_test, ypred-1.96*s, ypred+1.96*s, color='C1', alpha=0.25, label='Prediction Intervals');
plt.plot(X_test, ypred-1.96*noise, ':', color='gray', label='True Confidence Intervals')
plt.plot(X_test, ypred+1.96*noise, ':', color='gray')
plt.legend();
plt.xlabel(r"$x$")
plt.ylabel(r"$f$");
plt.title(rf"$\alpha={alpha:.1f}$; $\ell={ell:.2f}$; $\sigma = {sig:.3f}$");
Figure 2. Predicted mean and 95% confidence interval for individual data points generated by the Gaussian Process regression. The true 95% confidence intervals are shown as dashed gray lines. The observation noise standard deviation σ is estimated from marginal likelihood maximization.

Comparing the confidence intervals of the mean (Figure 1) and individual data points (Figure 2), it is evident that in this case, the observation noise dominates. Obtaining an accurate estimate of the observation variance $\sigma^2$ is crucial for estimating the confidence level in Figure 2.

Observation Noise Estimate From Regression Residuals

In addition to marginal likelihood optimization, another way to estimate the standard deviation of the observation noise is from the residuals of the regression. The standard error of the training data is used to obtain an estimate of $\sigma = 0.885$, compared to $\sigma = 0.922$ from likelihood maximization. The actual noise $\sigma$ is 1. Therefore, the residual method generates a slightly optimistic (smaller) confidence interval, as shown in Figure 3, where the estimated confidence interval (orange band) is smaller than that in Figure 2.

Figure 3. Predicted mean and 95% confidence interval for individual data points generated by the Gaussian Process regression. The true 95% confidence intervals are shown as dashed gray lines. The observation noise standard deviation σ is estimated from the residuals of the regression.

Conformal Prediction for Regression

Conformal prediction is a model-agnostic method for confidence interval estimation. The table below lists some of the key properties of the methods used in this study. In the coverage columns, $\alpha$ sets the coverage level. For example, $\alpha=0.05$ means the confidence interval has $95\%$ coverage.

Method Data for Prediction Data for Conformity Score Theoretical Coverage Typical Coverage
Naive Entire training data Entire training data No guarantee $<1-\alpha$
Jackknife Entire training data Leave-one-out No guarantee $\simeq 1-\alpha$
Jackknife+ Leave-one-out Leave-one-out $\geq 1-2\alpha$ $\simeq 1-\alpha$
Jackknife-minmax Leave-one-out Leave-one-out $\geq 1-\alpha$ $>1-\alpha$
CV Leave one fold out Leave one fold out No guarantee $\simeq 1-\alpha$
CV+ Leave one fold out Leave one fold out $\geq 1-2\alpha$ $\gtrsim 1-\alpha$
CV-minmax Leave one fold out Leave one fold out $\geq 1-\alpha$ $> 1-\alpha$
Jackknife-aB+ Bootstraps of training data Complement sets of the bootstraps used for training $\geq 1-2\alpha$ $\gtrsim 1-\alpha$

Various conformal prediction methods in the MAPIE package are used, and the prediction confidence intervals are shown in Figure 4. These intervals are more conservative than the estimation from Gaussian Process regression results shown in Figure 2 and Figure 3. In Figure 4, the Gaussian Process regression uses the kernel parameters and noise variance that maximize the marginal likelihood: $\alpha = 64.1$, $\ell=2.08$ and $\sigma=0.922$.

from mapie.regression import MapieRegressor
from mapie.subsample import Subsample

# Gaussian Process regressor with kernel parameters and noise variance that
# maximize the marginal likelihood
alpha, ell, sig = opt.x
rbf = ConstantKernel(alpha, constant_value_bounds='fixed')*RBF(ell, length_scale_bounds='fixed')
gp = GaussianProcessRegressor(kernel= rbf, alpha=sig**2)

methods = {
    "naive": dict(method="naive"),
    "jackknife": dict(method="base", cv=-1),
    "jackknife_plus": dict(method="plus", cv=-1),
    "jackknife_minmax": dict(method="minmax", cv=-1),
    "cv": dict(method="base", cv=10),
    "cv_plus": dict(method="plus", cv=10),
    "cv_minmax": dict(method="minmax", cv=10),
    "jackknife_plus_ab": dict(method="plus", cv=Subsample(n_resamplings=50))  
}

y_pred, y_pis = {}, {}
for method, params in methods.items():
    print(f"====={method}=======")
    mapie = MapieRegressor(gp, **params)
    mapie.fit(X_train, y_train)
    y_pred[method], y_pis[method] = mapie.predict(X_test.reshape(-1,1), alpha=0.05)
    
fig, axes = plt.subplots(4,2, figsize=(8,14), dpi=300, sharex=True, sharey=True)
for ax, method in zip(axes.flatten(), y_pred.keys()):
    y = y_pred[method]
    s = y_pis[method]
    ax.scatter(X_train, y_train, c='k', s=5,  label='Training Data')
    ax.plot(X_test, y_test, 'C0', label=r"$f=x\sinx$")
    ax.plot(X_test, y, 'C1', label='Prediction')
    ax.fill_between(X_test, s[:,0,0], s[:,1,0], color='C1', alpha=0.25, label='Prediction Intervals');
    ax.plot(X_test, y-1.96*noise, ':', color='gray', label='True Confidence Intervals')
    ax.plot(X_test, y+1.96*noise, ':', color='gray')
    ax.set_title(strategy)
    ax.legend(fontsize=9)
    plt.tight_layout();
Figure 4. 95% prediction confidence intervals generated by various methods of conformal prediction. The true 95% confidence intervals are shown as dashed gray lines.

Advantages of Conformal Prediction

In the proceeding sections, Gaussian Process regression with a radial basis kernel is used, and the kernel parameters and observation noise variation are set to values that maximize the marginal likelihood. However, when these parameters are not optimal, the confidence intervals estimated from the Gaussian Process regression alone become less accurate.

For example, with $\alpha=1$ and $\ell=1.5$ for the kernel and $\sigma=10^{-6}$, the predicted mean and its confidence intervals are shown in Figure 5. The confidence intervals for the mean are almost zero.

Figure 5. Predicted mean and its 95% confidence interval generated by the Gaussian Process regression. The kernel hyperparameters and the noise variance do not maximize the marginal likelihood.

The observation noise is estimated from the regression residuals, and the confidence intervals for individual data points are calculated accordingly, as shown in Figure 6. It is clear that the estimate is over-optimistic.

Figure 6. Predicted mean and 95% confidence interval for individual data points generated by the Gaussian Process regression. The observation noise standard deviation σ is estimated from the residuals of the regression. The kernel parameters do not optimize the marginal likelihood.

With conformal prediction, there is no need to estimate the observation noise variance directly. As shown in Figure 7, all conformal prediction methods generate reasonable confidence intervals that are close to the true intervals.

Figure 7. 95% prediction confidence intervals generated by various methods of conformal prediction. The kernel parameters of the Gaussian Process regression are not optimized to maximize the marginal likelihood.

References

Molnar, C. (2023). Introduction To Conformal Prediction With Python. Independently published.

MAPIE - Model Agnostic Prediction Interval Estimator Documentation