Statistical Evidence

Overview

Likelihood Methods for Measuring Statistical Evidence

Science looks to Statistics for an objective measure of the strength of evidence in a given body of observations. The Law of Likelihood (LL) provides an answer; it explains how data should be interpreted as statistical evidence for one hypothesis over another. Although likelihood ratios figure prominently in both Frequentist and Bayesian methodologies, they are neither the focus nor the endpoint of either methodology. A defining characteristic of the Likelihood paradigm is that the measure of the strength of evidence is decoupled from the probability of observing misleading evidence. As a result, controllable probabilities of observing misleading or weak evidence provide quantities analogous to the Type I and Type II error rates of hypothesis testing, but do not themselves represent the strength of the evidence in the data.

The Law of Likelihood is often confused with the Likelihood Principle. I discuss the likelihood principle (LP) and its relation to the third evidential metric - false discovery rates - on this page.

Evidential Axiom

The Law of Likelihood

The Law of Likelihood is an axiom for the interpretation of data as statistical evidence under a probability model (Royall 1997, Hacking 1965).

The Law of Likelihood (LL): If the first hypothesis, H1, implies that the probability that a random variable X takes the value x is P1(x|H1), while the second hypothesis, H2, implies that the probability is P2(x|H2), then the observation X=x is evidence supporting H1 over H2 if and only if P(x|H1)>P(x|H2), and the likelihood ratio, P(x|H1)/P(x|H2), measures the strength of that evidence.

Here P(x|H) is the probability of observing x under hypothesis H. The ratio of conditional probabilities, P(x|H1)/P(x|H2), is the likelihood ratio (LR). LL reasons that the hypothesis that more accurately predicted the observed data is the hypothesis that is better supported by the data. This is an intuitive proposition and is already routine reasoning in the sciences.

For the purpose of interpreting and communicating the strength of evidence, it is useful to divide the likelihood ratio scale into descriptive categories, such as ``weak", ``moderate", and ``strong" evidence. A LR of 1 indicates neutral evidence and benchmarks of k = 8 and 32 are used to distinguish between weak, moderate, and strong evidence.

LL only applies to 'simple' hypotheses that specify a single numeric value for the probability of observing x. While the reason for the precondition is obvious - the LR is not computable if P(x|H) is undefined - it does exclude 'composite' hypotheses that specify a set, or range, of probabilities. For example, if x~N(mu,1), probability statements such as P(x|mu>0), P(x|mu=1 or mu=-1), and P(x|mu=sample mean) are undefined.

This problem is not unique to LL; every statistical method that depends on a likelihood ratio - and that is the majority of them - has the same problem. The general solution is to pick one simple hypothesis to represent the set and then treat the chosen representative as if it were the pre-specified hypothesis. Examples include hypothesis testing, where the maximum is chosen as the representative, and Bayesian methods, where the prior-weighted average is chosen as the representative. Each has it's pros and cons. The obvious concerns are that the maximum will exaggerate the evidence against the null and the average will be sensitive to the choice of the prior.

Essentially, the Frequentist and Bayesian solution is to change the model/hypotheses so that the LR is computable. Likelihoodists, on the other hand, have sought to display and report all the evidence. They do this in the same way that power curves - another statistical quantity that is undefined for composite hypotheses - are displayed and summarized.

Examples

Likelihood Functions and Support Intervals

The graph below shows the likelihood function from a fictitious study where 10 out of 205 people had a cardiovascular event (black curve). The likelihood function is scaled to a maximum of 1 so it is easier to display. The x-axis is the hypothesis space: the numerical possibilities for the probability of having a cardiovascular event. All of the hypotheses under the crest of the curve (probabilities of CV event near 0.05) are well supported by the data.

Data were collected from three sites and the grey lines represent the site specific likelihoods (data were 0 out of 30, 2 out of 70 and 7 out of 105). Judging by the maxima alone, some sites appear to be doing better than others. However, this inference ignores the sampling variability. A better approach is to assess the overlap in the base of the curves, where it is clear the evidence is too weak to distinguish sites based on their performance alone.

The best supported hypothesis is the maximum likelihood estimator of 0.049 (=10/205). The relative measure of support for H1: prob=0.06 over H2: prob=0.3 is weak, with a likelihood ratio of only 2.24 =L(0.06)/L(0.03)=0.784/0.35. In fact, any two hypotheses under the blue line will have a likelihood ratio of 8 or less. The comparative likelihood ratio for any two hypotheses under the red line is 1/32 or less. These sets of hypotheses are called support intervals (SI), as they contain the best supported hypothesis up to the stated level of evidence. Hypotheses in a 1/k SI are inferentially equivalent in the sense that the data do not favor any hypothesis in the set over another by a factor of k or more. The blue interval is the 1/8 support interval; the red is the 1/32 support interval.

Under a normal model, support intervals and unadjusted confidence intervals (CI) have the same mathematical form. A 95% CI corresponds to a 1/6.8 SI, a 96% CI corresponds to a 1/8 SI (blue line), and 99% CI corresponds to a 1/32 SI (red line). The SI is defined relative to the likelihood function while the CI is defined relative to its coverage probability. Recall that the coverage probability of a CI can change even when the data, and the likelihood function, do not. The natural interpretation of an unadjusted CI as the 'collection of hypotheses that are best supported by the data' is justified by the Law of Likelihood.

Misleading Evidence

The Probability of Observing Misleading Evidence

An important aspect of the likelihood paradigm is how it controls the probability of observing misleading and weak evidence. Misleading evidence is defined as strong evidence in favor of the incorrect hypothesis over the correct hypothesis. After the data have been collected, the strength of the evidence will be determined by the likelihood ratio. Whether the evidence is weak or strong will be clear from the numerical value of the likelihood ratio. However, it remains unknown if the evidence is misleading or not.

One nice feature of likelihoods is that they are seldom misleading. The probability of observing misleading evidence of k-strength is always bounded by 1/k, the so-called universal bound (Royall 1997, Blume 2002). LL also has strong ties to classical hypothesis testing. It is the solution that minimizes the average error rate (Cornfield, 1966) and allows both error rates to converge to 0 as the sample size grows. It is not necessary to hold the Type I Error rate fixed to maintain good frequentist properties.

The universal bound applies to both fixed sample size and sequential study designs. In a sequential study, the probability of observing misleading evidence does increase with the number of looks at the data. However, the amount by which it increases shrinks to zero as the sample size grows. As a result, the overall probability can remain bounded (Blume 2007). The probability of observing misleading evidence in a fixed sample size study is less than the corresponding probability in a sequential study, and both are less than 1/k. This is shown below, where the subscript on P denotes the "true" hypothesis.