Data Smoothing with P-splines: An Implementation with scikit-learn and PyMC
This note uses P-splines (Penalized Splines) for data smoothing. Reducing the difference between the coefficients of spline bases makes the fit smoother. The smoothness control is implemented in two ways: 1) the difference between the coefficients as a regularization term in the least square minimization in scikit-learn; and 2) coefficients as Gaussian random walk in PyMC, a probabilistic programming library.
Spline Fit
A spline fit represents data as a linear combination of piecewise polynomials or basis splines. Figure 1 shows spline fit to the data with basis functions of various degrees.
B-spline
B-spline or basis spline is a spline function with minimal support for a given degree, smoothness, and domain partition (https://en.wikipedia.org/wiki/B-spline). We use B-spline basis functions in this note.
Python packages like patsy
and scikit-learn
can generate B-spline basis functions. We use scikit-learn
in this note, and the following code generates and plots the cubic B-spline basis functions with 20 knots.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import SplineTransformer
n_knots = 20
degree =3
splines = SplineTransformer(n_knots=n_knots, degree=degree, extrapolation='continue')
splines.fit(x) # calculate knots
x_plot = np.linspace(x.min(), x.max(), 1000)
b_splines = splines.transform(x_plot.reshape(-1,1)) #basis functions
#plot bases
fig, axes = plt.subplots(1,2, figsize=(10, 4))
fig.tight_layout()
shift = 0
offset = 0.7
for b in b_splines.T:
axes[0].plot(x_plot, b);
axes[1].plot(x_plot, b+shift);
shift += offset
At any location, the sum of the basis functions is 1.
Coefficients of Spline Fit
Given data points $(x_i, y_i)$, where $i=1,2, \ldots, m$, and $m$ is the number of data points. The values of the spline basis functions at the data locations $x_i$ are represented by the matrix:
\[B = \left[ \begin{array}{c} b_1(x_1) & b_2(x_1) & \cdots & b_n(x_1)\\ b_1(x_2) & b_2(x_2) & \cdots & b_n(x_2)\\ \vdots & \vdots & \vdots & \vdots\\ b_1(x_m) & b_2(x_m) & \cdots & b_n(x_m) \end{array} \right], \notag\]where $b_i(x_j)$ is the $i$-th basis evaluated at the location $x_j$, and $n$ is the number of spline basis functions.
The spline fit solves the following linear equation:
\(y = B\alpha, \notag\) where
\[y = \left[\begin{array}{c} y_1 \\ y_2 \\ \vdots\\ y_m \end{array}\right], \notag\]and the coefficients
\[\alpha = \left[\begin{array}{c} \alpha_1 \\ \alpha_2 \\ \vdots\\ \alpha_n \end{array}\right]. \notag\]$\alpha$ is solved by minimizing $\mid y-B\alpha\mid^2=(y-B\alpha)^T(y-B\alpha)$.
Control of Smoothness of Spline Fit
Penalized Splines
If the consecutive coefficients $\alpha_k$ and $\alpha_{k+1}$ are close, then the values of the spline fit are more comparable to a constant because the sum of the basis functions at any point is 1. Therefore, when consecutive coefficients are close, the spline fit is smooth.
To obtain a smoother spline fit, the difference between the neighboring coefficients, $\alpha_k$ and $\alpha_{k+1}$ is added to the residual of the spline fit:
\[(y-B\alpha)^T(y-B\alpha) +\lambda \mid\mid D\alpha\mid\mid^2, \label{eqn_objective}\]where $\lambda \mid\mid D\alpha\mid\mid^2$ is the penalizing term, $\lambda$ is the smoothing factor, and $D$ is the differentiation operator. For example, for $n=4$, the first-order discrete difference operator matrix is
\[D = \left[ \begin{array}{c} -1 & 1 & 0 & 0\\ 0 & -1 & 1 & 0\\ 0 & 0 & -1 & 1\\ \end{array} \right]. \notag\]And the second-order difference operator is
\[D = \left[ \begin{array}{c} 1 & -2 & 1 & 0\\ 0 & 1 & -2 & 1\\ \end{array} \right ]. \notag\]The linear equation to be solved with the penalized splines is
\[(B^T B + \lambda D^T D)\alpha = B^Ty. \label{eqn_pspline}\]Bayesian Regression
Alternatively, the smoothness of fit is controlled by setting the prior and distribution for the coefficients in a Bayesian linear regression model.
An Example
We generate one-dimensional data and use P-splines and Bayesian regression to fit the data.
Data
First, let us generate some data and plot them in Figure 3.
import numpy as np
from scipy import stats
n_points = 100
np.random.seed(seed=42)
x = np.linspace(0, 1.8*np.pi,n_points) + 2*stats.norm.rvs(size=n_points)
y = np.sin(x)*x + 2*stats.norm.rvs(size=n_points)
P-splines
The following code uses function Equation ($\ref{eqn_pspline}$) to calculate the coefficients of the spline fit and predicts for new predictor values.
def spline_fit(X, Y, n_knots, degree, lam, x_new):
"""spline fit from training data, then predict at new data.
Inputs:
X: 1D array of x. Used for spline fit
Y: 1D array of y. Used for spline fit
n_knots: number of knots
degree: degree of spline polynomial
lam: smoothing parameter
x_new: 1D array of new X values to predict Y for
Output:
y_new: the predicted value of y for x_new
"""
from sklearn.preprocessing import SplineTransformer
from sklearn.linear_model import LinearRegression
spline = SplineTransformer(n_knots=n_knots,
degree=degree,
extrapolation='continue')
bases = spline.fit_transform(X.reshape(-1,1))
D = np.eye(bases.shape[1])
D = np.diff(D, n=2, axis=0)
BtB = np.matmul(bases.T, bases)
DtD = np.matmul(D.T, D)
Bty = np.matmul(bases.T, Y)
alpha = (LinearRegression(fit_intercept=False)
.fit((BtB + lam*DtD), Bty)
.coef_)
#prediction
pred_bases = spline.transform(x_new.reshape(-1,1))
pred = np.matmul(pred_bases, alpha)
return pred
The spline fits with various smoothness, controlled by the parameter $\lambda$, are shown in Figure 4. As $\lambda$ becomes larger, the fit becomes smoother.
Figure 5 shows the data, spline fit, B-spline basis functions, and fit coefficients in the same plot. The spline fit (black line) is the sum of the weighted basis functions (colored lines).
Optimal Smoothness
The parameter $\lambda$ in Equation ($\ref{eqn_objective}$). controls the smoothness of a spline fit. But what is the optimal value of $\lambda$?
Like other regression models, the goodness of the spline fit is measured by its prediction of unseen data. Therefore we split the data into training and validation sets, then generate the spline fit with the training data and evaluate the prediction performance with the validation data.
We use 5-fold cross-validation for the data split and RMSE (root mean squared error) as the prediction performance metric.
Figure 6 shows the average RMSE of the 5-fold cross-validation for various values of $\lambda$. The optimal $\lambda \approx 10$ Where RMSE is the smallest.
Bayesian Regression with PyMC
The spline fit is obtained with Bayesian linear regression:
\[y = B\alpha +\mathcal{N}(0, \sigma_y), \notag\]where $B$ is a matrix representing the values of the spline bases at the data points, $x$.
To restrict the difference among the coefficients $\alpha$, we set the components of $\alpha$ to be from the Gaussian random walk, so they are correlated and similar.
\[\alpha_i = \mathcal{N}(\alpha_{i-1}, \sigma). \notag\]Further, the variance of the random walk, $\sigma$ is controlled by its prior distribution–an exponential distribution in this study. A single parameter λ controls the exponential distribution whose PDF is
\[\lambda \exp(-\lambda x). \notag\]Figure 7 shows the PDF for various values of $\lambda$. Larger $\lambda$ leads to tighter distribution.
The code for spline fit using the PyMC package is listed below.
import pymc as pm
splines = SplineTransformer(n_knots=20, degree=3) # spline
splines.fit(x.reshape(-1,1)) # calculate knots
spline_bases = splines.transform(x.reshape(-1,1)) # spline bases
n = spline_bases.shape[-1] # number of spline bases
λ = 1
with pm.Model() as model:
σ = pm.Exponential('σ', λ) # prior for sigma of Gaussian random walk
α = pm.GaussianRandomWalk('α', mu=0, sigma=σ,
init_dist = pm.Normal.dist(0,10),
shape=n) # coefficients
σ_y = pm.Exponential('σ_y', 1)
mu = pm.math.dot(pm_spline_bases, α)
obs = pm.Normal('obs', mu=mu, sigma=σ_y, observed=y)
The spline fit result is shown in Figure 8. The larger value of $\lambda$ leads to a tighter distribution among the spline fit coefficients $\alpha$, resulting in a smoother fit.
Reference
Eilers, P.H.C. and Marx, B.D. (2021). Practical Smoothing: The Joys of P-splines, Cambridge University Press.