Likelihood


The likelihood function or simply likelihood provides a measure of how well a set of population parameters applied to a statistical model describes a sample. Mathematically, it is a specific interpretation of the joint distribution function that the sample's random variables follow, but instead of setting parameters and changing samples, one sets the sample and changes the parameters.

If we consider an iid random sample X=(X1,,XN)\mathbf{X}=(X_{1},\ldots,X_{N}) of size NN and a model with MM parameters θ=(θ1,,θM)\boldsymbol{\theta}=(\theta_{1},\ldots,\theta_{M}), the probability PP of observing a realization x=(x1,,xN)\mathbf{x}=(x_{1},\ldots,x_{N}) of X\mathbf{X} given a specific set of parameters θ\boldsymbol{\theta}^{*} is given by the joint distribution function, which maps1

xP(xθ)\mathbf{x} \mapsto P(\mathbf{x}|\boldsymbol{\theta}^{*})

We can logically invert the relationship to find the probability (likelihood) L\mathcal{L} of observing parameters θ\boldsymbol{\theta} given a specific observation x\mathbf{x}^{*}:

θP(θx)L(θx)\boldsymbol{\theta} \mapsto P(\boldsymbol{\theta}|\mathbf{x}^{*})\equiv \mathcal{L}(\boldsymbol{\theta}|\mathbf{x}^{*})

Strictly speaking, these two functions are identical. What changes is which arguments we set and which we allow to vary, which changes the information that the function gives. The likelihood, in essence, asks the question: "Given our specific observation x\mathbf{x}^{*}, how confident are we that the process that generated it has parameters θ\boldsymbol{\theta} under our model?"

For an iid sample, we can call the probability density function of the common distribution fX(x;θ)f_{X}(x;\boldsymbol{\theta}). Then, since the likelihood is the joint density function, which itself is just the product of PDFs in an iid set, we can state

L(θ;x)=i=1NfX(xi;θ)\mathcal{L}(\boldsymbol{\theta};\mathbf{x})=\prod_{i=1}^{N} f_{X}(x_{i};\boldsymbol{\theta})

Thus evaluating the likelihood amounts to evaluating a product of probability density functions.

More generally, given some statistical model f(x;θ)f(\mathbf{x};\boldsymbol{\theta}), its likelihood is formally defined as

L:ΘR+θcxf(x;θ)\begin{align} \mathcal{L}:\Theta&\to \mathbb{R}^{+} \\ \boldsymbol{\theta}&\mapsto c_{\mathbf{x}}f(\mathbf{x};\boldsymbol{\theta}) \end{align}

where Θ\Theta is the space of all possible parameters and cxc_{\mathbf{x}} is some constant for the given x\mathbf{x}. This function allows for comparison between the credibility of different sets of parameters. Higher likelihood means higher credibility. Notably, the ratio of two likelihoods, L(θ1)/L(θ2)\mathcal{L}(\boldsymbol{\theta}_{1})/\mathcal{L}(\boldsymbol{\theta}_{2}), provides a relative comparison in which the constant c(x)c(\mathbf{x}) drops out. A formal justification of this credibility interpretation is given by the Wald inequality.

Log-likelihood

It is common to instead use the log-likelihood, which is simply the logarithm of the likelihood: (θ;x)logL(θ;x)\ell(\boldsymbol{\theta};\mathbf{x})\equiv \log \mathcal{L}(\boldsymbol{\theta};\mathbf{x}). The base of the logarithm is usually ee or 10. This form greatly reduces the range of numbers that are observed in practice, which helps with numerical stability. Analytically, it also turns many products into sums.

The Gradient of the log-likelihood is sometimes called the score function U(θ)(θ1logL,,θMlogL)U(\boldsymbol{\theta})\equiv(\partial_{\theta_{1}}\log \mathcal{L},\ldots,\partial_{\theta_{M}}\log \mathcal{L}). The negative Hessian is called the observed information matrix Jij(θ)2logLθiθjJ_{ij}(\boldsymbol{\theta})\equiv- \frac{ \partial ^{2}\log \mathcal{L} }{ \partial \theta_{i}\partial \theta_{j} }.

These functions possess some interesting properties, provided they are sufficiently regular:

  1. The expected score is zero: E[U(θ)]=0\text{E}[U(\boldsymbol{\theta})]=0.
  2. The second Bartlett identity holds: covθ(U(θ))=Eθ[J(θ)]=I(θ)\text{cov}_{\boldsymbol{\theta}}(U(\boldsymbol{\theta}))=\text{E}_{\boldsymbol{\theta}}[J(\boldsymbol{\theta})]=\mathcal{I}(\boldsymbol{\theta}). The function I(θ)\mathcal{I}(\boldsymbol{\theta}) (the Covariance of the score) is called the Fisher information matrix (or expected information matrix). This notation is shorthand for each combination of θi,θj\theta_{i},\theta_{j} in the covariance.
  3. The Cramer-Rao inequality holds. In one dimension, I(θ)=E[J(θ)]\mathcal{I}(\theta)=\text{E}[J(\theta)] and so varθ(θ^)1/I(θ)\text{var}_{\theta}(\hat{\theta})\geq 1/\mathcal{I}(\theta). In multiple dimension, this generalizes by adding the condition that cov(θ^)=I1(θ)\text{cov}(\hat{\boldsymbol{\theta}})=\mathcal{I}^{-1}(\boldsymbol{\theta}) is positive semidefinite.

Applications

The likelihood function is generally used in maximum likelihood estimation, which attempts to find the global maximum of L\mathcal{L} in order to find the most realistic parameters θ\boldsymbol{\theta} for a model. It is also a key component of Bayes' theorem, where it is multiplied by the prior to obtain the posterior.

Footnotes

  1. Actually, this is a bit tricky with continuous variables. For discrete variable, the JDF gives probability. For continuous ones, it's the integral of the JDF that gives the probability, as the JDF gives probability density. Either way, the likelihood interpretation is the same.