Maximum Likelihood Estimation (MLE)
Derivation
\(\sX=\{\vx^{(1)},\dots,\vx^{(m)}\}\) is a set of \(m\) examples drawn independently from the true (but unknown) data distribution \(\pdata(\vx)\). \(\ptrain\) is the empirical distribution of the training set \(\sX\).
\[
\begin{aligned}
\theta_{\mathrm{ML}} &= \argmax_\theta\ \pt(\sX) \\
&= \argmax_\theta\ \pt(\vx^{(1)},\dots,\vx^{(m)}) \\
&= \argmax_\theta\ \prod_{i=1}^m \pt(\vx^{(i)}) \\
&= \argmax_\theta\ \log\prod_{i=1}^m \pt(\vx^{(i)}) \\
&= \argmax_\theta\ \sum_{i=1}^m\log\pt(\vx^{(i)}) \\
&= \argmax_\theta\ \E_{\rvx\sim\ptrain(\vx)}[\log\pt(\rvx)] \\
&= \argmin_\theta\ \E_{\rvx\sim\ptrain(\vx)}[-\log\pt(\rvx)] \\
\end{aligned}
\]
Optimization
The following training goal: $$ \theta_{\mathrm{ML}} = \argmax_\theta\ \E_{\rvx\sim\ptrain(\vx)}[\log\pt(\rvx)] $$
corresponds to the Negative Log-Likelihood (NLL) loss function: $$ L(\theta) = \E_{\rvx\sim\ptrain(\vx)}[-\log\pt(\rvx)] $$
which can be optimized based on its gradients: $$ \nabla_\theta L(\theta) = \E_{\rvx\sim\ptrain(\vx)}[-\nabla_\theta\log\pt(\rvx)] $$