Data Smoothing with P-splines: An Implementation with scikit-learn and PyMC

This note uses P-splines (Penalized Splines) for data smoothing. Reducing the difference between the coefficients of spline bases makes the fit smoother. The smoothness control is implemented in two ways: 1) the difference between the coefficients as a regularization term in the least square minimization in scikit-learn; and 2) coefficients as Gaussian random walk in PyMC, a probabilistic programming library.

Spline Fit

A spline fit represents data as a linear combination of piecewise polynomials or basis splines. Figure 1 shows spline fit to the data with basis functions of various degrees.

Figure 1. Spline fit with basis functions of various degrees. Blue dots are the data. The red line is the spline fit. Gray vertical lines indicate the locations of the knots.

B-spline

B-spline or basis spline is a spline function with minimal support for a given degree, smoothness, and domain partition (https://en.wikipedia.org/wiki/B-spline). We use B-spline basis functions in this note.

Python packages like patsy and scikit-learn can generate B-spline basis functions. We use scikit-learn in this note, and the following code generates and plots the cubic B-spline basis functions with 20 knots.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import SplineTransformer

n_knots = 20
degree =3

splines = SplineTransformer(n_knots=n_knots, degree=degree, extrapolation='continue')
splines.fit(x) # calculate knots
x_plot = np.linspace(x.min(), x.max(), 1000)
b_splines = splines.transform(x_plot.reshape(-1,1)) #basis functions

#plot bases
fig, axes = plt.subplots(1,2, figsize=(10, 4))
fig.tight_layout()

shift = 0
offset = 0.7

for b in b_splines.T:
    axes[0].plot(x_plot, b);
    
    axes[1].plot(x_plot, b+shift);
    shift += offset

Figure 2. B-spline basis functions of degree 3 and 20 knots. There are a total of 20 + 3 - 1 = 22 basis functions. Left: all the basis functions are plotted with the same vertical scale; Right: Each basis function is shifted in the vertical direction for easy viewing.

At any location, the sum of the basis functions is 1.

Coefficients of Spline Fit

Given data points $(x_i, y_i)$, where $i=1,2, \ldots, m$, and $m$ is the number of data points. The values of the spline basis functions at the data locations $x_i$ are represented by the matrix:

\[B = \left[ \begin{array}{c} b_1(x_1) & b_2(x_1) & \cdots & b_n(x_1)\\ b_1(x_2) & b_2(x_2) & \cdots & b_n(x_2)\\ \vdots & \vdots & \vdots & \vdots\\ b_1(x_m) & b_2(x_m) & \cdots & b_n(x_m) \end{array} \right], \notag\]

where $b_i(x_j)$ is the $i$-th basis evaluated at the location $x_j$, and $n$ is the number of spline basis functions.

The spline fit solves the following linear equation:

$y = B\alpha, \notag$ where

\[y = \left[\begin{array}{c} y_1 \\ y_2 \\ \vdots\\ y_m \end{array}\right], \notag\]

and the coefficients

\[\alpha = \left[\begin{array}{c} \alpha_1 \\ \alpha_2 \\ \vdots\\ \alpha_n \end{array}\right]. \notag\]

$\alpha$ is solved by minimizing $\mid y-B\alpha\mid^2=(y-B\alpha)^T(y-B\alpha)$.

Control of Smoothness of Spline Fit

Penalized Splines

If the consecutive coefficients $\alpha_k$ and $\alpha_{k+1}$ are close, then the values of the spline fit are more comparable to a constant because the sum of the basis functions at any point is 1. Therefore, when consecutive coefficients are close, the spline fit is smooth.

To obtain a smoother spline fit, the difference between the neighboring coefficients, $\alpha_k$ and $\alpha_{k+1}$ is added to the residual of the spline fit:

\[(y-B\alpha)^T(y-B\alpha) +\lambda \mid\mid D\alpha\mid\mid^2, \label{eqn_objective}\]

where $\lambda \mid\mid D\alpha\mid\mid^2$ is the penalizing term, $\lambda$ is the smoothing factor, and $D$ is the differentiation operator. For example, for $n=4$, the first-order discrete difference operator matrix is

\[D = \left[ \begin{array}{c} -1 & 1 & 0 & 0\\ 0 & -1 & 1 & 0\\ 0 & 0 & -1 & 1\\ \end{array} \right]. \notag\]

And the second-order difference operator is

\[D = \left[ \begin{array}{c} 1 & -2 & 1 & 0\\ 0 & 1 & -2 & 1\\ \end{array} \right ]. \notag\]

The linear equation to be solved with the penalized splines is

\[(B^T B + \lambda D^T D)\alpha = B^Ty. \label{eqn_pspline}\]

Bayesian Regression

Alternatively, the smoothness of fit is controlled by setting the prior and distribution for the coefficients in a Bayesian linear regression model.

An Example

We generate one-dimensional data and use P-splines and Bayesian regression to fit the data.

Data

First, let us generate some data and plot them in Figure 3.

import numpy as np
from scipy import stats

n_points = 100
np.random.seed(seed=42)
x = np.linspace(0, 1.8*np.pi,n_points) + 2*stats.norm.rvs(size=n_points)
y = np.sin(x)*x + 2*stats.norm.rvs(size=n_points)

Figure 3. Generated data.

P-splines

The following code uses function Equation ($\ref{eqn_pspline}$) to calculate the coefficients of the spline fit and predicts for new predictor values.

def spline_fit(X, Y, n_knots, degree, lam, x_new):
    """spline fit from training data, then predict at new data.
    Inputs:
        X: 1D array of x. Used for spline fit
        Y: 1D array of y. Used for spline fit
        n_knots: number of knots
        degree: degree of spline polynomial
        lam: smoothing parameter
        x_new: 1D array of new X values to predict Y for
    Output:
        y_new: the predicted value of y for x_new
    """
    from sklearn.preprocessing import SplineTransformer
    from sklearn.linear_model import LinearRegression
    
    spline = SplineTransformer(n_knots=n_knots, 
                               degree=degree,
                               extrapolation='continue')
    bases = spline.fit_transform(X.reshape(-1,1))
    
    D = np.eye(bases.shape[1])
    D = np.diff(D, n=2, axis=0)
    BtB = np.matmul(bases.T, bases)
    DtD = np.matmul(D.T, D)
    Bty = np.matmul(bases.T, Y)
    alpha = (LinearRegression(fit_intercept=False)
             .fit((BtB + lam*DtD), Bty)
             .coef_)
    
    #prediction
    pred_bases = spline.transform(x_new.reshape(-1,1))
    pred = np.matmul(pred_bases, alpha)
    
    return pred

The spline fits with various smoothness, controlled by the parameter $\lambda$, are shown in Figure 4. As $\lambda$ becomes larger, the fit becomes smoother.

Figure 4. Spline fits with various smoothing factor values λ. Red lines are the spline fit. Cubic splines with 20 knots are used.

Figure 5 shows the data, spline fit, B-spline basis functions, and fit coefficients in the same plot. The spline fit (black line) is the sum of the weighted basis functions (colored lines).

Figure 5. Spline fit (black line) and the B-spline basis functions (colored lines). The basis functions have been multiplied by the spline fit coefficients. The values of the coefficients are plotted (color circles). A spline basis function and its corresponding coefficient have the same color. The gray connected dots are the data.

Optimal Smoothness

The parameter $\lambda$ in Equation ($\ref{eqn_objective}$). controls the smoothness of a spline fit. But what is the optimal value of $\lambda$?

Like other regression models, the goodness of the spline fit is measured by its prediction of unseen data. Therefore we split the data into training and validation sets, then generate the spline fit with the training data and evaluate the prediction performance with the validation data.

We use 5-fold cross-validation for the data split and RMSE (root mean squared error) as the prediction performance metric.

Figure 6 shows the average RMSE of the 5-fold cross-validation for various values of $\lambda$. The optimal $\lambda \approx 10$ Where RMSE is the smallest.

Figure 6. RMSE, representing the prediction performance of the spline fit versus λ.

Bayesian Regression with PyMC

The spline fit is obtained with Bayesian linear regression:

\[y = B\alpha +\mathcal{N}(0, \sigma_y), \notag\]

where $B$ is a matrix representing the values of the spline bases at the data points, $x$.

To restrict the difference among the coefficients $\alpha$, we set the components of $\alpha$ to be from the Gaussian random walk, so they are correlated and similar.

\[\alpha_i = \mathcal{N}(\alpha_{i-1}, \sigma). \notag\]

Further, the variance of the random walk, $\sigma$ is controlled by its prior distribution–an exponential distribution in this study. A single parameter λ controls the exponential distribution whose PDF is

\[\lambda \exp(-\lambda x). \notag\]

Figure 7 shows the PDF for various values of $\lambda$. Larger $\lambda$ leads to tighter distribution.

Figure 7. PDF of the exponential distribution with various values of λ. The larger the λ, the tighter the distribution.

The code for spline fit using the PyMC package is listed below.

import pymc as pm

splines = SplineTransformer(n_knots=20, degree=3) # spline
splines.fit(x.reshape(-1,1)) # calculate knots
spline_bases = splines.transform(x.reshape(-1,1)) # spline bases

n = spline_bases.shape[-1] # number of spline bases
λ = 1
with pm.Model() as model:
    σ = pm.Exponential('σ', λ) # prior for sigma of Gaussian random walk
    α = pm.GaussianRandomWalk('α', mu=0, sigma=σ, 
                              init_dist = pm.Normal.dist(0,10), 
                              shape=n) # coefficients  
    σ_y = pm.Exponential('σ_y', 1) 
    mu = pm.math.dot(pm_spline_bases, α)
    obs = pm.Normal('obs', mu=mu, sigma=σ_y, observed=y)

The spline fit result is shown in Figure 8. The larger value of $\lambda$ leads to a tighter distribution among the spline fit coefficients $\alpha$, resulting in a smoother fit.

Figure 8. Spline fit using Bayesian regression. The smoothness of the fit is controlled by λ in the prior distribution of σ.

Reference

Eilers, P.H.C. and Marx, B.D. (2021). Practical Smoothing: The Joys of P-splines, Cambridge University Press.