Why p = 0.05 doesn't always mean p = 0.05
In scientific research, seeing p < 0.05 is often celebrated as evidence of a “statistically significant” finding. However, this common interpretation can be misleading, especially when multiple comparisons are made or the model is incorrectly specified. Here’s why the familiar p-value threshold doesn’t always guarantee that the Type I error rate (the likelihood of falsely rejecting a true null hypothesis) is limited to 0.05.
Imagine you’re testing the effect of a treatment across 20 different outcomes. If you apply a p < 0.05 threshold to each test individually, the chance of finding at least one false positive (a Type I error) increases dramatically. This is because the more comparisons you make, the greater the likelihood that one of them will produce a statistically significant result purely by chance.
In fact, the probability of making at least one Type I error across 20 independent tests is not 0.05, but closer to 64% (1 – (0.95)^20). Researchers can control for this inflation of error rates by adjusting their p-value threshold (e.g., Bonferroni correction), but many fail to do so. Without adjustment, claiming significance at p < 0.05 under multiple comparisons grossly underestimates the true error rate.
Even when only one hypothesis is tested, the assumption that p < 0.05 limits the Type I error rate to 0.05 only holds if the statistical model is properly specified. This means that all relevant factors are accounted for, and the functional form of the model aligns with the underlying data-generating process. If the model is incorrectly specified—whether by omitting important variables, misinterpreting relationships, or using inappropriate distributions—the p-value can no longer be trusted as an accurate measure of statistical significance.
An improperly specified model can distort the results, leading to spurious findings where the p-value suggests significance even though the underlying assumptions are violated. In such cases, even though p < 0.05 may appear to indicate a low probability of a Type I error, the actual error rate can be much higher.
While the threshold of p < 0.05 is a useful convention, it is crucial to recognize its limitations. When multiple comparisons are made without correction, or when models are misspecified, the supposed control over Type I errors can vanish. As scientists, particularly in fields like epidemiology, it’s vital to approach p-values with caution and ensure rigorous checks on both the number of comparisons and model validity.
Statistical significance is just one piece of the puzzle—robust science requires understanding the context behind the numbers.
