Here θ represents probability of heads. An appropriate likelihood function summarizes the data
in a way that facilitates inference and prediction. In
this case, the summarization implies that the order
of heads and tails in the 12 tosses doesn’t matter
for estimating the probability of heads on the next
toss—the number of heads contains all of the relevant information. While this particular example
is elementary, specifying appropriate likelihood
models is, generally speaking, no simple task. In
realistic problems, an appropriate model is not as
obvious as it is here, and an iterative process of data
visualization and model testing is advisable.
Perhaps unsurprisingly, Bayesians and frequentists view likelihood functions somewhat
differently. For frequentists, the data is the randomly varying quantity in the above expression—what
we observed is merely a single random draw from
a hypothetical infinite population. The frequentist
therefore computes the value of θ that results in
the highest probability being assigned to the data
actually observed. In this example, the resulting
maximum-likelihood estimate of the probability
of heads is very intuitive: three-twelfths.
In contrast, the Bayesian views the data as a fixed
fact about the world and considers the unknown
parameter θ to be the randomly varying quantity. This
point is absolutely fundamental and closely related
to the Bayesian premise that probabilities represent
degrees of belief rather than relative frequencies. As
with the cancer example, so with the possibly biased coin: In each example, the truth is unknown.
Therefore we must model it with a probability function. The likelihood in itself does not constitute a
0.0
0.2
theta
0.4 0.6
0.8
complete model; a prior probability distribution is
also needed to model the unknown quantity θ. But
where does this model come from?
We can answer this question with another question. Where does the likelihood function or, for that
matter, any scientific model come from? The analyst uses his or her judgment, often in an iterative
fashion, to specify a model that best captures the
salient aspects of the situation being modeled. The
appropriate prior probability distribution depends
on the available background information and how
much of this information the analyst chooses to reflect in his or her model.
Scenario 1 (no background information): In this
particular case, we know that our statistician friend
has a history of tossing biased coins. But we don’t
know whether the present coin is biased or, if it is,
what that bias might be. In situations such as this,
it is common practice to use a “non-informative”
prior that in effect treats all possible values of θ the
same way. In this way, Bayesian statistics allows us
to plead ignorance about quantities that we have
little or no prior information about.
Figure 1 illustrates the use of conditional probability to update from the flat, non-informative prior
probability distribution to the posterior distribution that peaks at a value near one-quarter. (Note
to mavens: the prior and posterior are Beta( 1, 1)
and Beta( 4, 10) respectively. See the fourth edition
of Herzog’s Introduction to Credibility Theory for
details). Now suppose the statistician tosses the
coin another 40 times, 10 of which land heads.
Our probability is again updated accordingly and