Measuring the Strength of Statistical Evidence

Strength of Evidence

The Propensity for an Evidential Metric to be Misleading or Inconclusive
The probability that observed evidence is mistaken

The first evidential metric is the measure of the strength of evidence in a given body of data; it is the researcher’s essential tool for understanding what the data say. This tool is typically justified on an axiomatic or intuitive basis. Typically, they are single number summaries, but they don't have to be.

Let's say we choose to use metric M. After collecting data, we compute and report M. Examples of metrics that indicate relative support are likelihood ratios and Bayes factors, and divergence measures. Examples of metrics that indicate absolute support for or against a single hypothesis are p-values and posterior probabilities

Sometimes the data, by way of its interpretation through M, indicate support for a false hypothesis or indicate that the data are inconclusive. Neither outcome is desired. Studies are designed on, and evaluated by, how often they generate misleading or weak evidence.

Accordingly, the second evidential metric is the propensity to observe undesirable outcomes such as misleading or weak evidence. These are often called "error rates", but the name is a misnomer. The analyst does not make a mistake by interpreting data according to some prescribed evidential metric. Rather, it is the data themselves that are misleading.

For example, suppose we have two competing hypotheses A and B. The classical frequency properties of a study design would be

  • P(M supports Hypothesis A | Hypothesis B)

  • P(M supports Hypothesis B | Hypothesis A)

  • P(M indicates weak evidence about either hypothesis | Hypothesis B)

  • P(M indicates weak evidence about either hypothesis | Hypothesis A)

A good study minimizes these probabilities to the best extent possible, and there are many different statistical strategies for doing this. These probabilistic measures describe properties of the experiment.

Once the data are collected, we compute and report M. We will know if the evidence is strong or weak (inconclusive) from the observed value of M. However, we will never know if M is misleading or not. (Only strong evidence can be misleading. Weak evidence, by definition, is not impactful enough to mislead.)

The third evidential metric is propensity for observed evidence to be misleading. It is an essential compliment to the first evidential metric. Ideally, one would report the observed metric to describe the observed strength of the evidence as well as the chance that the observed results are mistaken. This third evidential metric is known as a false discovery rate; it is a property of the observed data.

Suppose we have two competing hypotheses A and B. The false discovery rates are

  • P(Hypothesis B | M supported Hypothesis A)

  • P(Hypothesis A | M supported Hypothesis B)

These probabilities often require the use of Bayes theorem in order to be computed, and that presents special problems. Once data are observed, it is the false discovery rates that are the relevant assessments of uncertainty. The original frequency properties of the study design - the error rates - are no longer relevant. Failure to distinguish between these evidential metric leads to circular reasoning and irresolvable confusion about the interpretation of results as statistical evidence. 

False Discovery Rates

Error Rates