Student’s t Mixture Model with PyMC

In this note, we compare the Gaussian mixture model and Student’s-t mixture model for some two-dimensional data with an unbalanced proportion of clusters, as shown in Figure 1. The result demonstrates that the Student’s-t mixture model performs much better.

Dimension Reduction

We first reduce the problem from two-dimensional to one-dimensional using spline fit since the variation between the clusters is mainly contained in the residual of the fit. We use the CSAPS package for Python to calculate the spline fit. The one-dimensional array of the residuals will be used for separating the groups with a mixture model.

Mixture Models

Gaussian Mixture Model

The Gaussian mixture model is widely used and assumes that the data are generated from mixtures of Gaussian distributions. We use GaussianMixure in scikit-learn with two covariance types: each component has its own covariance matrix (‘full’), and all components have the same covariance matrix (‘tied’).

Full Covariance Type

The result looks poor: it overestimates the difference in sigma but underestimates the difference in mean between the components. The sigma in one component is so large that the component is fragmented. The proportion of the smaller component is also too large.

Component	Mean	Sigma	Fraction
1	-0.5	4.4	16.6%
2	0.29	1.2	83.4%

Tied Covariance Type

The result is improved by requiring the same covariance for all components: the fragmentation is gone. However, the result underestimates the mean difference, and the proportion of the smaller component is still substantial.

Component	Mean	Sigma	Proportion
1	-0.95	2.1	16.2%
2	0.46	2.1	83.8%

Student’s t Mixture Model

To overcome the shortcomings of the Gaussian mixture model, we use Student’s t distribution (Equation ($\ref{eqn:student}$)) to model the components in the mixture model, described by Equation ($\ref{eqn:mixture}$).

\[p(x\mid\theta)=\sum_{k=1}^K \pi_k f(x\mid\mu_k, \sigma_k, \nu_k). \label{eqn:mixture}\]

Student’s T distribution:

\[f(x\mid \mu, \sigma, \nu)=\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)}\frac{1}{\sqrt{\pi\nu}\sigma}\left[1+\frac{1}{\nu}\left(\frac{x-\mu}{\sigma}\right)^2\right]^{-\frac{\nu+1}{2}}. \label{eqn:student}\]

The Bayesian inference code with the PyMC3 package is listed below. Weakly informative priors are used.

def MixtureModel(data, nmode):
    """Inputs:
        data: 1-D array values 
        nmode: integer, number of modes in distribution of data
       Output: 
        model: PyMC model
    """

    ndata = len(data) # sample size

    with pm.Model() as model:

        p = pm.Dirichlet('p', a = np.ones(nmode), shape=nmode)
        
        #ordered mean
        mean = pm.Normal('mean', mu=np.ones(nmode), sigma=100, shape=nmode,
                         transform=pm.transforms.ordered, testval=np.arange(nmode))  

        sigma = pm.Exponential('sigma', 1, shape=nmode)

        nu = pm.Exponential('nu', 1, shape=nmode)

        dist = pm.StudentT.dist(nu=nu, mu=mean, sigma=sigma, shape=nmode)

        obs = pm.Mixture('obs', w = p, comp_dists=dist, observed=data) 

    return model
  
  ## Bimodal model, y is the residual array
  m2 = MixtureModel(y, 2)
  with m2:
    trace2 = pm.sample(1000, return_inferencedata=True)
  # Trimodal model
  m3 = MixtureModel(y, 3)
  with m3:
    trace3 = pm.sample(1000, tune=2000, target_accept=0.95, return_inferencedata=True)

The model with two components is the best based on Pareto-smoothed importance sampling LOO cross-validation (Figure 5). As shown in Figure 6, The separation of the two components looks much better than with the Gaussian mixture model. The components’ location, spread, and proportions are also summarized in Figure 7 and the table below. They all have reasonable values.

Component k	Mean $μ_k $	Scale $σ_k$	nu $\nu_k$	Fraction $π_k$
1	-4.97	0.99	2.8	2.5%
2	0.13	1.23	9.1	97.5%

To summarize, the Student’smixture model is more robust than the Gaussian mixture model to handle imbalanced components and distributions with long tails. MCMC Bayesian inference using probabilistic programming packages like PyMC3 is straightforward to implement, and the result provides an uncertainty estimate through the posterior distributions.