Multiple comparisons problem

In statistics, the multiple comparisons problem occurs when one subjects a number of independent observations to the same acceptance criterion that would be used when considering a single event.

Typically, an acceptance criterion of a single event takes the form of a requirement that the observed data be highly unlikely under a default assumption (null hypothesis). As the number of independent applications of the acceptance criterion begins to outweigh the high unlikelihood associated with each individual test, it becomes increasingly likely that one will observe data that satisfies the acceptance criterion by chance alone (even if the default assumption is true in all cases). These errors are considered false positives because they positively identify a set of observations as satisfying the acceptance criterion while that data in fact represents the null hypothesis. Many mathematical techniques have been developed to counter the false positive error rate associated with making multiple statistical comparisons.

Flipping coins

For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the likelihood that a fair coin would come up heads at least 9 out of 10 times is 11/2¹⁰ = 0.0107. This is relatively unlikely, and under statistical criteria such as p-value < 0.05, one would declare that the null hypothesis should be rejected - i.e. the coin is unfair.

A multiple comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one was to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular (i.e. pre-selected) coin come up heads 9 or 10 times would still be very unlikely, but seeing some coin, it doesn't matter which one, behave that way would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1 − 0.0107)¹⁰⁰ ≈ 0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would more likely than not falsely identify a fair coin as unfair.

Formalism

Technically, the problem of multiple comparisons (also known as multiple testing problem) can be described as the potential increase in Type I error that occurs when statistical tests are used repeatedly: If n independent comparisons are performed, the experiment-wide significance level α (alpha) is given by

Failed to parse (syntax error): {\displaystyle \alpha = 1-\left( 1-\alpha_\mathrm{per\ comparison} \right)^{\scriptstyle{\mbox{number of comparisons}}}

and it increases as the number of comparisons increases. If we do not assume that the comparisons are independent, then we can still say:

\alpha \leq \alpha _{\mathrm {per\ comparison} }\times {\mbox{number of comparisons}}.

Methods

In order to retain the same overall rate of false positives (rather than a higher rate) in a test involving more than one comparison, the standards for each comparison must be more stringent. Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved to be true. (In other words, 1-(1-α/n)ⁿ<α.) For instance, to obtain the usual alpha of 0.05 with ten comparisons, we can use an alpha of 0.005 for each comparison to result in an overall alpha of about 0.049.

Not compensating for multiple comparisons can have important real world consequences; for instance, it may result in failure to approve a drug which is in fact superior to existing drugs, thereby both depriving the world of an improved therapy, and also causing the drug company to lose the substantial investment in research and development up to that point. If a new drug is compared to an existing drug or placebo over a range of 20 possible side-effects, it could easily happen by chance that the new drug appears to be worse for some side-effect, when it is actually not worse for this side-effect.

Some simple techniques for compensating for multiple comparisons can be too conservative. For this reason, there has been a great deal of attention paid to developing better techniques, such that the overall rate of false positives can be maintained without inflating the rate of false negatives unnecessarily. Such methods can be divided into general categories:

Methods where total alpha can be proved to never exceed 0.05 (or some other chosen value) under any conditions. These methods provide "strong" control against Type I error, in all conditions including a partially correct null hypothesis.
Methods where total alpha can be proved not to exceed 0.05 except under certain defined conditions.
Methods which rely on an omnibus test before proceeding to multiple comparisons. Typically these methods require a significant ANOVA/Tukey range test before proceeding to multiple comparisons. These methods have "weak" control of Type I error.
Empirical methods, which control the proportion of Type I errors adaptively, utilizing correlation and distribution characteristics of the observed data.

The advent of computerized resampling methods, such as bootstrapping and Monte Carlo simulations, has given rise to many techniques in the latter category. In some cases where exhaustive permutation resampling is performed, these tests provide exact, strong control of Type I error rates; in other cases, such as bootstrap sampling, they provide only approximate control.

Post hoc testing of ANOVAs

Multiple comparison procedures are commonly used after obtaining a significant omnibus test, like the ANOVA F-test. The significant ANOVA result suggests rejecting the global null hypothesis H₀ = "means are the same". Multiple comparison procedures are then used to determine which means are different from which.

Comparing K means involves K(K − 1)/2 pairwise comparisons.

The non-parametric Friedman test is useful when doing multiple test on an hypothesis.^[vague]

The Nemenyi test is similar to the ANOVA Tukey test.

The Bonferroni-Dunn test allows comparisons with a control.^[vague]

Large-scale multiple testing

For large-scale multiple testing (for example, as is very common in genomics when using technologies such as DNA microarrays) one can instead control the false discovery rate (FDR), defined to be the expected proportion of false positives among all significant tests.

Bibliography

Abdi, H (2007). "[1] ((2007). Bonferroni and Sidak corrections for multiple comparisons. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage". {{cite journal}}: Cite has empty unknown parameter: |1= (help); Cite journal requires |journal= (help); External link in |title= (help)
Miller, R G (1966) Simultaneous Statistical Inference (New York: McGraw-Hill). ISBN 0-387-90548-0
Miller, R G (1981) "Simultaneous Statistical Inference 2nd Ed" (Springer Verlag New York) ISBN 0-387-90548-0
Benjamini, Y, and Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological) 57:125-133.
Storey JD and Tibshirani (2003) "Statistical significance for genome-wide studies" PNAS 100, 9440–9445. [2]