Regression Uncertainty Estimation with Conformal Prediction
In this note, we estimate the regression prediction intervals using various conformal prediction methods. The regression model employed is a Gaussian Process regressor. We compare the confidence intervals generated by conformal prediction and Gaussian Process regression. Without conformal prediction, it is crucial to accurately estimate the variances of the observation noise and the predicted mean. This necessitates optimizing the kernel parameters and the noise variance to maximize the marginal likelihood. However, with conformal prediction, such a requirement is no longer necessary.
Gaussian Process Regression
A radial basis or squared exponential kernel is used in the Gaussian Process regression:
\[K(\mathbf{x}_i, \mathbf{x}_y) = \alpha \exp\left({-\frac{\mid \mathbf{x}_i-\mathbf{x}_j\mid^2}{2 \ell^2}}\right). \notag\]When there is noise in the observation data, the covariance matrix becomes:
\[\Sigma = K(\mathbf{x}, \mathbf{x}) + \sigma^2 \mathbf{I}, \notag\]where $\mathbf{x}$ is the observation data and $\sigma^2$ is the variance of the noise, and $\mathbf{I}$ is the identity matrix.
The kernel hyperparaemters $\alpha$, $\ell$ and the noise variance $\sigma^2$ can be optimized by maximizing the marginal likelihood of the observed data. Equivalently we minimize the negative log-likelihood:
\[L = \log(\mathrm{det}(\Sigma)) + (\mathbf{y}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{y}-\mathbf{\mu}),\]where $\mathbf{y}$ is the observed data and $\mathbf{\mu}$ is the corresponding mean, which is assumed $0$ in this study.
In this study, we generate data from $y = x \sin x$ and then add Gaussian noise with $\sigma^2=1$.
import numpy as np
# training data
n = 100
noise = 1
rng = np.random.RandomState(1)
X_train = rng.random(size=n)*10
y_train = np.sin(X_train)*X_train + rng.normal(size=n)*noise
X_train = X_train.reshape(-1,1)
# new data
X_test = np.linspace(0, 10, 1000)
y_test = X_test*np.sin(X_test)
Optimizing Model Parameters by Marginal Likelihood Maximization
Next, we find the values of $\alpha$, $\ell$ and $\sigma^2$ that maximize the marginal likelihood:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, RBF
from numpy.linalg import det, inv
from scipy.optimize import minimize
def nll(θ, x, y):
alpha, ell, sigma = θ
Σ = alpha*RBF(length_scale=ell)(x) + sigma**2*np.eye(len(x))
return np.log(det(Σ)) + y @ inv(Σ) @ y.T
opt = minimize(lambda theta, x, y: nll(theta, x, y)/len(x), [1,1,1], method='Nelder-Mead',args=(X_train,y_train))
opt
final_simplex: (array([[64.09115007, 2.08012245, 0.92224139],
[64.09122573, 2.0801224 , 0.92224147],
[64.09108209, 2.08012175, 0.92224139],
[64.09119223, 2.08012239, 0.92224131]]), array([1.25425296, 1.25425296, 1.25425296, 1.25425296]))
fun: 1.2542529573068577
message: 'Optimization terminated successfully.'
nfev: 219
nit: 124
status: 0
success: True
x: array([64.09115007, 2.08012245, 0.92224139])
The optimal values are: $\alpha=64.1$, $\ell=2.08$ and $\sigma=0.922$.
Confidence Intervals for the Mean and for Individual Data Points
The optimized values of $\alpha$, $\ell$ and $\sigma$ are used to build the Gaussian Process regressor in sklearn
and generate predictions and their confidence levels.
import matplotlib.pyplot as plt
alpha, ell, sig = opt.x
rbf = ConstantKernel(alpha, constant_value_bounds='fixed')*RBF(ell, length_scale_bounds='fixed')
gp = GaussianProcessRegressor(kernel= rbf, alpha=sig**2)
gp.fit(X_train, y_train)
ypred, spred = gp.predict(X_test.reshape(-1,1), return_std=True)
plt.scatter(X_train,y_train,c='k', label='Training Data')
plt.plot(X_test,y_test, 'C0', label=r"$f=x \sin x$");
plt.plot(X_test, ypred, 'C1', label='Prediction')
plt.fill_between(X_test, ypred-1.96*spred, ypred+1.96*spred, color='C1', alpha=0.25, label='Prediction Intervals of Mean');
plt.legend();
plt.xlabel(r"$x$")
plt.ylabel(r"$f$");
plt.title(rf"$\alpha={alpha:.1f}$; $\ell={ell:.2f}$; $\sigma = {sig:.3f}$");
The confidence interval from the Gaussian Process regressor is the $95\%$ confidence interval of the predicted mean, as shown in Figure 1.
To generate the confidence interval for individual data points, the variances of the mean prediction and the observation noise need to be combined:
\[\sigma_{\text{individual}}=\sqrt{\sigma^2_{\text{mean}}+\sigma^2},\notag\]where $\sigma_{\text{mean}}$ is the standard error of the prediction generated by the Gaussian Process regression, $\sigma$ is the standard error of the noise in the observation, which is estimated by maximizing the marginal likelihood. The confidence intervals for individual data points are shown in Figure 2.
s = np.sqrt(spred**2 + sig**2) #combined variance
plt.scatter(X_train,y_train,c='k', label='Training Data')
plt.plot(X_test,y_test, 'C0', label=r"$f=x \sin x$");
plt.plot(X_test, ypred, 'C1', label='Prediction')
plt.fill_between(X_test, ypred-1.96*s, ypred+1.96*s, color='C1', alpha=0.25, label='Prediction Intervals');
plt.plot(X_test, ypred-1.96*noise, ':', color='gray', label='True Confidence Intervals')
plt.plot(X_test, ypred+1.96*noise, ':', color='gray')
plt.legend();
plt.xlabel(r"$x$")
plt.ylabel(r"$f$");
plt.title(rf"$\alpha={alpha:.1f}$; $\ell={ell:.2f}$; $\sigma = {sig:.3f}$");
Comparing the confidence intervals of the mean (Figure 1) and individual data points (Figure 2), it is evident that in this case, the observation noise dominates. Obtaining an accurate estimate of the observation variance $\sigma^2$ is crucial for estimating the confidence level in Figure 2.
Observation Noise Estimate From Regression Residuals
In addition to marginal likelihood optimization, another way to estimate the standard deviation of the observation noise is from the residuals of the regression. The standard error of the training data is used to obtain an estimate of $\sigma = 0.885$, compared to $\sigma = 0.922$ from likelihood maximization. The actual noise $\sigma$ is 1. Therefore, the residual method generates a slightly optimistic (smaller) confidence interval, as shown in Figure 3, where the estimated confidence interval (orange band) is smaller than that in Figure 2.
Conformal Prediction for Regression
Conformal prediction is a model-agnostic method for confidence interval estimation. The table below lists some of the key properties of the methods used in this study. In the coverage columns, $\alpha$ sets the coverage level. For example, $\alpha=0.05$ means the confidence interval has $95\%$ coverage.
Method | Data for Prediction | Data for Conformity Score | Theoretical Coverage | Typical Coverage |
---|---|---|---|---|
Naive | Entire training data | Entire training data | No guarantee | $<1-\alpha$ |
Jackknife | Entire training data | Leave-one-out | No guarantee | $\simeq 1-\alpha$ |
Jackknife+ | Leave-one-out | Leave-one-out | $\geq 1-2\alpha$ | $\simeq 1-\alpha$ |
Jackknife-minmax | Leave-one-out | Leave-one-out | $\geq 1-\alpha$ | $>1-\alpha$ |
CV | Leave one fold out | Leave one fold out | No guarantee | $\simeq 1-\alpha$ |
CV+ | Leave one fold out | Leave one fold out | $\geq 1-2\alpha$ | $\gtrsim 1-\alpha$ |
CV-minmax | Leave one fold out | Leave one fold out | $\geq 1-\alpha$ | $> 1-\alpha$ |
Jackknife-aB+ | Bootstraps of training data | Complement sets of the bootstraps used for training | $\geq 1-2\alpha$ | $\gtrsim 1-\alpha$ |
Various conformal prediction methods in the MAPIE
package are used, and the prediction confidence intervals are shown in Figure 4. These intervals are more conservative than the estimation from Gaussian Process regression results shown in Figure 2 and Figure 3. In Figure 4, the Gaussian Process regression uses the kernel parameters and noise variance that maximize the marginal likelihood: $\alpha = 64.1$, $\ell=2.08$ and $\sigma=0.922$.
from mapie.regression import MapieRegressor
from mapie.subsample import Subsample
# Gaussian Process regressor with kernel parameters and noise variance that
# maximize the marginal likelihood
alpha, ell, sig = opt.x
rbf = ConstantKernel(alpha, constant_value_bounds='fixed')*RBF(ell, length_scale_bounds='fixed')
gp = GaussianProcessRegressor(kernel= rbf, alpha=sig**2)
methods = {
"naive": dict(method="naive"),
"jackknife": dict(method="base", cv=-1),
"jackknife_plus": dict(method="plus", cv=-1),
"jackknife_minmax": dict(method="minmax", cv=-1),
"cv": dict(method="base", cv=10),
"cv_plus": dict(method="plus", cv=10),
"cv_minmax": dict(method="minmax", cv=10),
"jackknife_plus_ab": dict(method="plus", cv=Subsample(n_resamplings=50))
}
y_pred, y_pis = {}, {}
for method, params in methods.items():
print(f"====={method}=======")
mapie = MapieRegressor(gp, **params)
mapie.fit(X_train, y_train)
y_pred[method], y_pis[method] = mapie.predict(X_test.reshape(-1,1), alpha=0.05)
fig, axes = plt.subplots(4,2, figsize=(8,14), dpi=300, sharex=True, sharey=True)
for ax, method in zip(axes.flatten(), y_pred.keys()):
y = y_pred[method]
s = y_pis[method]
ax.scatter(X_train, y_train, c='k', s=5, label='Training Data')
ax.plot(X_test, y_test, 'C0', label=r"$f=x\sinx$")
ax.plot(X_test, y, 'C1', label='Prediction')
ax.fill_between(X_test, s[:,0,0], s[:,1,0], color='C1', alpha=0.25, label='Prediction Intervals');
ax.plot(X_test, y-1.96*noise, ':', color='gray', label='True Confidence Intervals')
ax.plot(X_test, y+1.96*noise, ':', color='gray')
ax.set_title(strategy)
ax.legend(fontsize=9)
plt.tight_layout();
Advantages of Conformal Prediction
In the proceeding sections, Gaussian Process regression with a radial basis kernel is used, and the kernel parameters and observation noise variation are set to values that maximize the marginal likelihood. However, when these parameters are not optimal, the confidence intervals estimated from the Gaussian Process regression alone become less accurate.
For example, with $\alpha=1$ and $\ell=1.5$ for the kernel and $\sigma=10^{-6}$, the predicted mean and its confidence intervals are shown in Figure 5. The confidence intervals for the mean are almost zero.
The observation noise is estimated from the regression residuals, and the confidence intervals for individual data points are calculated accordingly, as shown in Figure 6. It is clear that the estimate is over-optimistic.
With conformal prediction, there is no need to estimate the observation noise variance directly. As shown in Figure 7, all conformal prediction methods generate reasonable confidence intervals that are close to the true intervals.
References
Molnar, C. (2023). Introduction To Conformal Prediction With Python. Independently published.
MAPIE - Model Agnostic Prediction Interval Estimator Documentation