Statistical Evidence | Error Rate and FDR

Introduction

Error Rates vs. False Discovery Rates

Understanding the difference between an ‘error’ rate and a ‘false discovery’ rate is essential. Error rates are properties of the study design; false discovery rates are properties of the observed results. Error rates are used to improve the rigor of an experiment; false discovery rates are used to assess the reliability of an observed result.

While this is a relatively new idea to statisticians, it has long been understood by physicians. The performance of a diagnostic test is evaluated by its sensitivity and specificity (‘error’ rate compliments) while the reliability of an observed test result is given by the test’s positive predictive value and negative predictive value (false discovery rate compliments).

Example

Illustration by Simulation

Suppose we collect 14 observations from a normal distribution where the standard deviation is known to be 1. The null hypothesis is that the population mean is 0 and the alternative hypothesis is that the population mean is ½.

To perform a hypothesis test, we need a rejection rule. Let’s reject the null when the sample mean is greater than ½. I repeated the test 1,000,000 times drawing data according to the alternative hypothesis and another 1,000,000 times drawing data according to the null hypothesis. Each repetition consisted of drawing 14 observations and then checking to see if the sample mean is greater than ½. Here are the results:

The simulated Type I Error rate is 0.031 (= 30,668 / 1,000,000), which matches the theoretical error rate 0.031=P(sample mean > ½ | population mean is 0). Likewise, the simulated power is 0.499 (= 499,350 / 1,000,000), which matches the theoretical power 0.5=P(sample mean > ½ | population mean is ½). This close agreement is not surprising given the large number of repetitions. So far, so good.

A false discovery rate (FDR) is the propensity for an observed result to be mistaken. For example, out of all the repetitions that lead to a rejection of the null hypothesis, how many came from a null simulation? From the table we see this is 0.0578 (= 30,668/530,018), which matches the theoretical FDR: 0.058 = P(population mean is 0 | sample mean > ½ ) = 1 / ( 1 + 0.5 / 0.031 ). Likewise, the simulated false confirmation rate (FCR) is 0.341 (= 500,650 / 1,469,982), which matches its theoretical probability: 0.340=P(population mean is ½ | sample mean ≤ ½ )= 1 / ( 1 + (1 - 0.031) / (1 - 0.5) ). Bayes theorem gives these probabilities (their formulae are at the bottom), but it is just as easy to pull them off the table. Note that the table reflects the assumed 50-50 prevalence. In practice, computing a FDR means making an assumption about the prior probabilities of the hypotheses (more on this in a bit). Nevertheless, this challenge should not diminish the FDR's important inferential role.

While the error rates themselves do not reflect the tendency of observed results to be mistaken, they certainly impact that tendency. Even though the error rates are no longer interpretable after collecting data, their influence is reflected in the FDRs.

To drive home the different roles of these rates, I’ve graphed the simulated data in two different ways. The graph on the left is the traditional representation. It shows the sampling distributions of the sample mean under the null and alternative hypotheses. The area to the right of the rejection line is the simulated Type I Error rate and power.

This classic representation (left) presupposes knowledge of which sample mean goes with which hypothesis. Of course, in practice, we never know this. As a result, this graph is not very useful for illustrating statistical inference from observed data.

The graph on the right is a better representation of the problem at hand. The histogram displays all the simulated sample means as a single group; the black line is a smoothed density estimator for the mixture distribution (it’s a mixture distribution because the null and alternative repetitions are mixed together). The purple area represents all the instances where the sample mean was greater than ½ (530,018 of them). The blue and red lines represent the competing hypotheses.

From this graph, we can see why the FDR is the relevant reliability measure for observed sample means. Of all the sample means greater than ½, only 0.0578 = (30,668 / 2,000,000) / (530,018 / 2,000,000) came from the null distribution.

This is the essence of the false discovery rate; it conditions on the observed data and infers back to the hypotheses in question. Error rates do exactly the opposite. Note that the FDR is the ratio of purple areas under the blue and black lines.

We can also see why sample means less than ½ are not as reliable. It turns out that 34.1% = (500,650 / 2,000,000) / (1,469,982 / 2,000,000) of them were generated under the alternative hypothesis. This false confirmation rate is the ratio of white areas under the red and black lines.

How the results are reported also makes a difference to the false discovery rate. The example has so far treated the result of the hypothesis test as the summary of the data. But what if more information were reported?

Suppose we observed a sample mean of 0.6. While the test still rejects, we now have additional information and the FDR will change. The new false discovery rate is the probability that the population mean is zero given the observed sample mean of 0.6. This probability is represented by the ratio of curve heights (blue line at 0.6 to black line at 0.6) instead of a ratio of areas. The FDR increases to 0.081 (from 0.0578). This false discovery rate has a special name: the local false discovery rate. The adjective 'local' is meant to emphasize that the rate pertains to a specific result, namely sample means equal to 0.6.

We can verify the local rate using our simulated data. There were 60,348 sample means between 0.58 and 0.62, of which 4,811 came from null simulations (none were exactly 0.6, so we need to consider a small range). The simulated FDR was 0.080 (=4,811/60,348), which matches its theoretical value of 0.079= P(population mean is 0 | sample mean =0.6 )= 1/(1+11.588).

Where does the 11.588 come from? It is the likelihood ratio for alternative hypothesis vs. the null based on the observed sample mean. Mathematically, it is LR= P(sample mean = 0.6 | population mean is ½ )/ P(sample mean = 0.6 | population mean is 0 ) =11.588. A close approximation can be obtained from our simulated data: P(0.58<sample mean <0.62 | population mean is ½ )/ P(0.58<sample mean <0.62 | population mean is 0 )= (55,537 / 60,348) / (4,811 / 60,348) =11.54.

Lastly, to illustrate the role of the prevalence in computing the FDR, I ran the simulation again but with only 200,000 null simulations. Now 83% (10/12) of the repetitions are from alternative simulations. The histogram displays these results. Clearly the red alternative hypothesis better fits the mixture distribution of the observed sample means.

Miscellanea

Formulae and Code

Github page with R code for simulations