Suppose that we have access to some data set as follow: \(x^{(1)},x^{(2)},x^{(3)},\cdots x^{(m)}\). Our main intresting question is: can we find a random variable \(X\) and distribution \(P\) such that \(p(X)\) can well-modeled our data by this distribution?. In general, this is a hard problem. If we consider the class of all possible distributions, there is no way to best fit to the data.

Instead, the common strategy is to choose our distribution from a certain parameterized family of distribution \(p(X;θ)\), parameterized by \(θ\), and have our goal be to find the parameters \(θ\) that fit the data best. Even here there are multiple different approaches that are possible, but at the very least this gives us a more concrete problem that lets us better attack the underlying problem.

In this set of notes, we’ll first answer this estimation problem by appeal to the maximum likelihood estimation (MLE) procedure.

Maximum likelihood estimation

Given some parameterized distribution \(p(X;θ)\), and an collection of (independent) samples \(x^{(1)},x^{(2)},x^{(3)},\cdots x^{(m)}.\) We can compute the probability of observing this set of samples under the distribution, which is simply given by

\(p(x^{(1)},x^{(2)},x^{(3)},\cdots x^{(m)};θ)=∏_{i=1}^mp(x^{(i)};θ)\) We suppose here samples are all assumed to be independent.The basic idea of maximum likelihood estimation, is that we want to pick parameters \(θ\)that maximize the probaiblity of the observed data; in other words, we want to choose \(θ\) to solve the optimization problem

maximize θ i = 1 m p ( x ( i ) ; θ )

or equivalently (because maximizing a function is equivalent to maximizing the log of that function, and we can scale this function arbitrarily).

maximize θ 1 m i = 1 m log p ( x ( i ) ; θ )

The term we are maximizing above is so common that it usually has it’s own name: the log-likelihood of the data, written as

( θ ) = 1 m i = 1 m log p ( x ( i ) ; θ )

where we explicitly write \(ℓ\) as a function of \(θ\) because we want to emphasize the fact this likelihood depends on the parameters.

This procedure may seem “obvious” when stated like this (of course we want to find the parameters that make the data as likely as possible), but there are actually a number of other estimators that are equally valid or reasonable in many situations. We’ll consider some of these when we discuss hypothesis testing and then later probabilistic modeling, but for now, maximum likelihood estimation will serve as a nice principle for how we fit parameters of distributions to data.

Example: Bernoulli distribution

Let’s take a simple example as an illustration of this point for the Bernoulli distribution. Recall that a Bernoulli distribution, \(p(X;ϕ)\) is a simple binary distribution over random variables taking values in \({0,1}\), parameterized by \(ϕ\), which is just the probability of the random variable being equal to one. Now suppose we have some data \(x^{(1)},x^{(2)},x^{(3)},\cdots x^{(m)}\) with \(x^{(i)}∈({0,1})\); what would be a good estimate of the Bernoulli parameter \(ϕ\)? For example, maybe we flipped a coin 100 times and 30 of these times it came up heads; what would be a good estimate for the probability that this coin comes up heads?

The “obvious” answer here is that we just estimate \(ϕ\) to be the proportion of 1’s in the data

ϕ = # 1's # Total = i = 1 m x ( i ) m .

But why is this the case? If we flip the coin just once, for example, would we expect that we should estimate \(ϕ\) to be either zero or one? Maybe some other estimators exist that can better handle our expectation that the coin “should” be unbiased, i.e., have \(ϕ=1/2\).

While this is certainly true, in fact that maximum likelihood estimate of \(ϕ\) is just the equation above, the number of ones divided by the total number. So this gives some rationale that at least under the principles of maximum likelihood esimation, we should believe that this is a good estimate. However, showing that this is in fact the maximum likelihood estimator is a little more involved that you might expect. Let’s go through the derivation to see how this work.

First, recall that our objective is to choose \(ϕ\) maximize the likelihood, or equivalently the log likelihood of the data, of the observed data \(x^{(1)},x^{(2)},x^{(3)},\cdots x^{(m)}\). This can be written as the optimization problem.

maximize ϕ i = 1 m log p ( x ( i ) ; ϕ ) .

Recall that the probability under a Bernoulli distribution is just \(p(X=1;ϕ)=ϕ\), and \(p(X=0;ϕ)1−ϕ\), , which we can write compactly as

p ( X = x ; ϕ ) = ϕ x ( 1 ϕ ) ( 1 x )

it’s easy to see that this equals \(ϕ\) or \(x=1\) and \(1-ϕ\) for \(x=0\). Plugging this in to our maximum likelihood optimization problem we have

maximize ϕ i = 1 m ( x ( i ) log ϕ + ( 1 x ( i ) ) log ( 1 ϕ ) )

In order to maximize this equation, let’s take the derivative and set it equal to 0 (though we won’t show it, it turns out this function just a single maximum point, which thus must have derivative zero, and so we can find it in this manner). Via some basic calculus we have

d d ϕ i = 1 m ( x ( i ) log ϕ + ( 1 x ( i ) ) log ( 1 ϕ ) ) = i = 1 m d d ϕ ( x ( i ) log ϕ + ( 1 x ( i ) ) log ( 1 ϕ ) ) = i = 1 m ( x ( i ) ϕ 1 x ( i ) 1 ϕ )

Setting this term equal to zero we have

i = 1 m ( x ( i ) ϕ 1 x ( i ) 1 ϕ ) = 0 i = 1 m x ( i ) ϕ i = 1 m ( 1 x ( i ) ) 1 ϕ = 0 ( 1 ϕ ) i = 1 m x ( i ) ϕ i = 1 m ( 1 x ( i ) ) = 0 i = 1 m x ( i ) = ϕ i = 1 m ( x ( i ) + ( 1 x ( i ) ) ) i = 1 m x ( i ) = ϕ m ϕ = i = 1 m x ( i ) m .

And there we have it, the surprisingly long proof of the fact that if we want to pick \(ϕ\) to maximize the likelihood of the observed data, we need to choose it to be equal to the empirical proportion of the ones. Of course, the objections we had at the beginning of this section were also valid: and in fact this perhaps is not the best estimate of \(ϕ\) if we have very little data, or some prior information about what values \(ϕ\) should take. But it is the estimate of \(ϕ\) that maximizes the probability of the observed data, and if this is a bad estimate then it reflects more on the underlying problem with this procedure than with the proof above. Nonetheless, in the presence of a lot of data, there is actually good reason to use the maximum likelihood estimator, and it is extremely common to use in practice.