Maximum Likelihood Estimation is An Approximation to Minimization of KL Distance

This is a brief derivation that maximum likelihood estimation approximates the minimization of the Kullback-Leibler (KL) distance.

Suppose the observed data $\bf{x}$ are from a probability distribution $P(\bf{x})$. Our model of the distribution is $Q(\bf{x}\vert\theta)$, where $\theta$ is the model parameter. The Kullback-Leibler distance between the model and the true distribution is

\[K(\theta) = \int P(\bf{x})\log \frac{P(\mathbf{x})}{Q(\bf{x}\vert \theta)}.\]

Estimation of $\theta$ is obtained by minimizing $K(\theta)$:

\[\hat{\theta} = \mathrm{argmin}_\theta K(\theta) = \mathrm{argmax}_\theta \int P(\bf{x})\log Q(\bf{x}\vert \theta). \label{eqn:argmax}\]

The last step is because $P(\bf{x})$ does not depend on $\bf{x}$ and the $P(\bf{x})\log P(\bf{x})$ term is dropped.

Finally, using the Law of Large Numbers, the Equation ($\ref{eqn:argmax}$) becomes

\[\begin{align} \notag \hat{\theta} &= \mathrm{argmax}_\theta \int P(\bf{x})\log Q( \bf{x}\vert \theta) \\ \notag & \approx \mathrm{argmax}_\theta\frac{1}{N}\sum_{i=1}^{N} \log Q(\bf{x}_i\vert \theta) \\ \notag & = \mathrm{argmax}_\theta \log \Pi_{i=1}^{N}Q(\bf{x}_i) \\ \notag & = \mathrm{argmax}_\theta \Pi_{i=1}^{N}Q(\bf{x}_i)\\ \notag & = \mathrm{argmax}_\theta L(\bf{x}_1, \bf{x}_2, \ldots, \bf{x}_N\vert \theta), \notag \end{align}\]

where $N$ is the number of observed data points, $L(\bf{x}_1, \bf{x}_2, \ldots, \bf{x}_N\vert \theta)$ is the likelihood. This derivation shows that by approximating the true distribution with the observed sample data, maximum likelihood estimation is equivalent to minimizing the KL distance.