It might seem that rats deciding what they are seeing in a split second has little in common with
scientists deciding about hypotheses on the timescale of months or years. But both of these
require deciding how much evidence to sample and how or when to stop sampling and commit to an action
or conclusion. In both cases it is important to be efficient (minimize the cost or delay of gathering
evidence) but also reliable (minimize errors).
For the case of scientific decisions, the null hypothesis significance test (NHST)
framework dominates experimental biology and psychology. This paradigm requires deciding in advance
how many samples will be collected, collecting exactly that many, performing a statistical test and
then either reject or fail to reject a null hypothesis (e.g. no difference between treatment and
control). If preliminary experiments are used to decide if the hypothesis is worth testing (a
quantitative or qualitative judgement about the likeliness that there is a difference), or to
estimate how many samples will be required (power analysis from estimated effect size), these
preliminary data are excluded from use in the NHST experiment. A common violation of that rule
is for researchers who get an "almost" significant (but not significant) result in the NHST to add
more samples and re-test, which is called "N-hacking".
We used simulations to explore the impact of "N-hacking" on the trustworthiness of the conclusions
reached thereby, with some surprising results. For example, suppose you start with Ninit in your initial
planned sample size (ranging from 2 to 128, horzontal axis) and set a significance criterion of α < 0.05.
You collect the planned sample and them perform your statistical test. But then whenever P is "almost" but not
quite significant, you add Nincr more samples and retest (Nincr ranging from 1 to 128, colors).
Presumably there is a practical limit to the total number of samples you can collect, perhaps a numeric cap (e.g. 256 total) or
a multiple of the initial sample (e.g. 5 Ninit).
Suppose you don't do any correction for sequential sampling or
multiple comparisons, and if a test yields a "significant" finding at any step, you reject the null hypothesis.
Classic N-hacking. As the graph shows: (1) the false positive rate is higher than 0.05; (2) the larger
your initial sample or the smaller your incremental sample between tests, the worse it is; (3) nevertheless it
asymptotes at some value not much more than α as Nincr gets large and Nincr goes to 1.
The story gets even more interesting when
you take into acount the power and positive predictive value of experiments. Spoiler alert: it turns out that the
way we make rapid sensory decisions (by sequential sampling) is smart, and in some parameter regimes N-hacking is an approximation
of sequential sampling.
To simulate the effect of Nhacking on false positive rate yourself, try this
matlab code or run it on
CodeOcean