Scientific Decision-Making

Reinagel, P. (2023) Reinagel P (2023) Is N-Hacking Ever OK? The consequences of collecting more data in pursuit of statistical significance. PLoS Biol 21(11): e3002345. in press . Code:

It might seem that rats deciding what they are seeing in a split second has little in common with scientists deciding about hypotheses on the timescale of months or years. But both of these require deciding how much evidence to sample and how or when to stop sampling and commit to an action or conclusion. In both cases it is important to be efficient (minimize the cost or delay of gathering evidence) but also reliable (minimize errors).

For the case of scientific decisions, the null hypothesis significance test (NHST) framework dominates experimental biology and psychology. This paradigm requires deciding in advance how many samples will be collected, collecting exactly that many, performing a statistical test and then either reject or fail to reject a null hypothesis (e.g. no difference between treatment and control). If preliminary experiments are used to decide if the hypothesis is worth testing (a quantitative or qualitative judgement about the likeliness that there is a difference), or to estimate how many samples will be required (power analysis from estimated effect size), these preliminary data are excluded from use in the final statistical test. A common violation of that rule is for researchers who get an "almost" significant (but not significant) result in the NHST to add more samples and re-test, which is called "P-hacking" or more specifically "N-hacking".

We used simulations to explore the impact of "N-hacking" on the trustworthiness of the conclusions reached thereby, with some surprising results. For example, suppose you start with Ninit in your initial planned sample size (ranging from 2 to 128, horzontal axis) and set a significance criterion of α < 0.05. You collect the planned sample and them perform your statistical test. But then whenever P is "almost" but not quite significant, you add Nincr more samples and retest (Nincr ranging from 1 to 128, colors). Presumably there is a practical limit to the total number of samples you can collect, perhaps a numeric cap (e.g. 256 total) or a multiple of the initial sample (e.g. 5 Ninit).

Suppose you don't do any correction for sequential sampling or multiple comparisons, and if a test yields a "significant" finding at any step, you reject the null hypothesis. Classic N-hacking. As the graph shows: (1) the false positive rate is higher than 0.05; (2) the larger your initial sample or the smaller your incremental sample between tests, the worse it is; (3) nevertheless it asymptotes at some value not much more than α as Nincr gets large and Nincr goes to 1. The story gets even more interesting when you take into acount the power and positive predictive value of experiments.

To simulate the effect of Nhacking on false positive rate yourself, try this matlab code