In this note, we compare the Gaussian mixture model and Student’s-t mixture model for some two-dimensional data with an unbalanced proportion of clusters, as shown in Figure 1. The result demonstrates that the Student’s-t mixture model performs much better.

Figure 1. The two-dimensional data points appear to have two clusters: one main population on the top and a much smaller group below.

Dimension Reduction

We first reduce the problem from two-dimensional to one-dimensional using spline fit since the variation between the clusters is mainly contained in the residual of the fit. We use the CSAPS package for Python to calculate the spline fit. The one-dimensional array of the residuals will be used for separating the groups with a mixture model.

Figure 2. Spline fit (left) and the distribution of residuals (right).

Mixture Models

Gaussian Mixture Model

The Gaussian mixture model is widely used and assumes that the data are generated from mixtures of Gaussian distributions. We use GaussianMixure in scikit-learn with two covariance types: each component has its own covariance matrix (‘full’), and all components have the same covariance matrix (‘tied’).

Full Covariance Type

The result looks poor: it overestimates the difference in sigma but underestimates the difference in mean between the components. The sigma in one component is so large that the component is fragmented. The proportion of the smaller component is also too large.

Figure 3. Result of a Gaussian mixture model with full covariance type. Each component has its own covariance matrix. The distribution of residuals (left) and the component assignments of the original data points.
Component Mean Sigma Fraction
1 -0.5 4.4 16.6%
2 0.29 1.2 83.4%

Tied Covariance Type

The result is improved by requiring the same covariance for all components: the fragmentation is gone. However, the result underestimates the mean difference, and the proportion of the smaller component is still substantial.

Figure 4. Result of a Gaussian mixture model with a tied covariance type: all components have the same covariance matrix. The distribution of residuals (left) and the cluster assignments of the original data points.
Component Mean Sigma Proportion
1 -0.95 2.1 16.2%
2 0.46 2.1 83.8%

Student’s t Mixture Model

To overcome the shortcomings of the Gaussian mixture model, we use Student’s t distribution (Equation ($\ref{eqn:student}$)) to model the components in the mixture model, described by Equation ($\ref{eqn:mixture}$).

\[p(x\mid\theta)=\sum_{k=1}^K \pi_k f(x\mid\mu_k, \sigma_k, \nu_k). \label{eqn:mixture}\]

Student’s T distribution:

\[f(x\mid \mu, \sigma, \nu)=\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)}\frac{1}{\sqrt{\pi\nu}\sigma}\left[1+\frac{1}{\nu}\left(\frac{x-\mu}{\sigma}\right)^2\right]^{-\frac{\nu+1}{2}}. \label{eqn:student}\]

The Bayesian inference code with the PyMC3 package is listed below. Weakly informative priors are used.

def MixtureModel(data, nmode):
    """Inputs:
        data: 1-D array values 
        nmode: integer, number of modes in distribution of data
       Output: 
        model: PyMC model
    """

    ndata = len(data) # sample size

    with pm.Model() as model:

        p = pm.Dirichlet('p', a = np.ones(nmode), shape=nmode)
        
        #ordered mean
        mean = pm.Normal('mean', mu=np.ones(nmode), sigma=100, shape=nmode,
                         transform=pm.transforms.ordered, testval=np.arange(nmode))  

        sigma = pm.Exponential('sigma', 1, shape=nmode)

        nu = pm.Exponential('nu', 1, shape=nmode)

        dist = pm.StudentT.dist(nu=nu, mu=mean, sigma=sigma, shape=nmode)

        obs = pm.Mixture('obs', w = p, comp_dists=dist, observed=data) 

    return model
  
  ## Bimodal model, y is the residual array
  m2 = MixtureModel(y, 2)
  with m2:
    trace2 = pm.sample(1000, return_inferencedata=True)
  # Trimodal model
  m3 = MixtureModel(y, 3)
  with m3:
    trace3 = pm.sample(1000, tune=2000, target_accept=0.95, return_inferencedata=True)

The model with two components is the best based on Pareto-smoothed importance sampling LOO cross-validation (Figure 5). As shown in Figure 6, The separation of the two components looks much better than with the Gaussian mixture model. The components’ location, spread, and proportions are also summarized in Figure 7 and the table below. They all have reasonable values.

Figure 5. Model comparison with Leave One Out cross-validation using Pareto-smoothed importance sampling. The bimodal model is the best among the models with one (unimodal), two (bimodal), and three (trimodal) modes.
Figure 6. Result of a Student's t mixture model with two components. The distribution of residuals (left) and the cluster assignments of the original data points.
Figure 7. Posterior distributions of the Student's t mixture model with two components.
Component k Mean $μ_k $ Scale $σ_k$ nu $\nu_k$ Fraction $π_k$
1 -4.97 0.99 2.8 2.5%
2 0.13 1.23 9.1 97.5%

To summarize, the Student’smixture model is more robust than the Gaussian mixture model to handle imbalanced components and distributions with long tails. MCMC Bayesian inference using probabilistic programming packages like PyMC3 is straightforward to implement, and the result provides an uncertainty estimate through the posterior distributions.