
Improving reproducibility in animal research by splitting the study population into several ‘mini-experiments’
- Select a language for the TTS:
- UK English Female
- UK English Male
- US English Female
- US English Male
- Australian Female
- Australian Male
- Language selected: (auto detect) - EN
Play all audios:
In light of the hotly discussed ‘reproducibility crisis’, a rethinking of current methodologies appears essential. Implementing multi-laboratory designs has been shown to enhance the
external validity and hence the reproducibility of findings from animal research. We here aimed at proposing a new experimental strategy that transfers this logic into a single-laboratory
setting. We systematically introduced heterogeneity into our study population by splitting an experiment into several ‘mini-experiments’ spread over different time points a few weeks apart.
We hypothesised to observe improved reproducibility in such a ‘mini-experiment’ design in comparison to a conventionally standardised design, according to which all animals are tested at one
specific point in time. By comparing both designs across independent replicates, we could indeed show that the use of such a ‘mini-experiment’ design improved the reproducibility and
accurate detection of exemplary treatment effects (behavioural and physiological differences between four mouse strains) in about half of all investigated strain comparisons. Thus, we
successfully implemented and empirically validated an easy-to-handle strategy to tackle poor reproducibility in single-laboratory studies. Since other experiments within different life
science disciplines share the main characteristics with the investigation reported here, these studies are likely to also benefit from this approach.
However, multi-laboratory studies are logistically challenging and are not yet suitable to replace the broad mass of single-laboratory studies. Therefore, solutions are urgently needed to
tackle the problem of poor reproducibility at the level of single-laboratory studies. Against this background, the overall idea of the present study was to design an experimental strategy
that transfers the multi-laboratory logic into a single-laboratory setting, and at the same time offers a high degree of practical relevance.
For the successful implementation of such a strategy, the following two requirements have to be met: First of all, considering the logic of the multi-laboratory approach, factors need to be
identified that inevitably vary between experiments as they would vary between laboratories in a multi-laboratory setting and are not part of the study question. A promising and repeatedly
highlighted candidate in this respect is the time of testing throughout the year (referred to as ‘batch’ for example by Refs.29,36,37). This factor has not only been shown to substantially
influence the phenotype of mice tested in the same laboratory38,39, but can also be regarded as sort of an ‘umbrella factor’ for plenty of uncontrollable varying known and unknown background
factors (e.g. changing personnel, noise, temperature, etc.). By covering a diverse spectrum of background heterogeneity, variation of this factor thus automatically enhances the
representativeness of the study population as the variation of the laboratory environment does in the multi-laboratory approach. Second, a successful implementation critically relies on the
feasibility of the approach and its potential to introduce the necessary variation in a systematic and controlled way. Again, the time of testing throughout the year appears a promising
factor: It can be easily varied in a systematic and controlled way, and the implementation appears feasible as it simply implies to collect data over time. Against this background, we here
propose and validate an experimental strategy that builds on systematically varying the time of testing by splitting an experiment into several independent ‘mini-experiments’ conducted at
different time points throughout the year (referred to as ‘mini-experiment’ design in the following, see Fig. 1). In light of the above presented conceptual framework, we hypothesise to
observe improved reproducibility of research findings in such a mini-experiment design in comparison to a conventionally standardised design, according to which all animals are tested at one
specific point in time (referred to as conventional design in the following, see Fig. 1).
Concept of the study. (a) Transfer of the multi-laboratory approach into a single-laboratory situation. In the multi-laboratory situation, the integration of different laboratory
environments in one study results in a heterogenous study population. In the single-laboratory approach, the animals are tested in the same laboratory, but in different mini-experiments
spread over three time points (t1–t3). Between mini-experiments, uncontrollable factors of the laboratory environment may vary in the same way as they may vary between laboratories in the
multi-laboratory approach. Thereby, the heterogeneity of the study population is enhanced, resembling the logic of the multi-laboratory approach. (b) Overview of the study design: Strain
differences were repeatedly investigated in four independent replicate experiments in both a conventional (Con, red) and a mini-experiment design (Mini, blue). In the conventional design,
all animals of one replicate experiment (e.g. Con 1) were tested at one specific point in time. In the mini-experiment design, by contrast, one replicate experiment (e.g. Mini 1) was split
in three mini-experiments (Mini 1a, Mini 1b, and Mini 1c), all organised in the same way. Please note, whenever mice of a conventional replicate experiment were tested, also one
mini-experiment of the corresponding mini-experiment replicate experiment was conducted to control for potential time-point specific background effects. Experimental phase: EPM Elevated Plus
Maze, OF Open Field test, NC Novel Cage test, Barrier Barrier test, PB Puzzle Box test, FCMs + FTMs Collection of faecal samples for assessment of corticosterone and testosterone
metabolites, SC Sucrose Consumption test, NT Nest test, HCB Home cage behaviour. In addition, body weights were taken.
The above described examination was done in two experimental designs, namely a mini-experiment design and a conventional design, which were then compared with respect to their effectiveness
in terms of reproducibility. More precisely, to examine the reproducibility of the strain differences in both designs, a total of four conventional and four mini-experiment replicate
experiments were conducted successively over a time period of 1.5 years (conventional design: Con 1–Con 4 and mini-experiment design: Mini 1–Mini 4, Fig. 1b). Within each conventional
replicate experiment and each mini-experiment replicate experiment, a sample size of 9 mice per strain were tested. In the mini-experiment design, however, one replicate experiment was split
in three mini-experiments (Mini 1a, Mini 1b, Mini 1c, Fig. 1b), each comprising a reduced number of animals at one specific point in time (i.e. 3 per strain). Thus, whereas all mice of one
conventional replicate experiment (i.e. 9 per strain) were delivered and tested at one specific point in time, in the mini-experiment design, they were delivered and tested in three
mini-experiments that were conducted at different time points throughout the year (i.e. 3 mice per strain per mini-experiment). Consequently, factors, such as the age of the animals at
delivery or the age of the animals at testing were the same for all mini-experiments and the conventional replicate experiments. Furthermore, concerning factors, such as for example
temperature, personnel, and type of bedding, conditions were kept as constant as possible within mini-experiments, whereas between mini-experiments conditions were allowed to vary (for
details see Supplementary Table S1). This approach was taken to reflect the usual fluctuations in conditions between independent studies.
Both designs were organised according to a randomised block design. In the mini-experiment design, one ‘block’ corresponded to one mini-experiment and in the conventional design, one ‘block’
subsumed cages of mice positioned within the same rack and tested consecutively in the battery of tests (for details see Supplementary Fig. S1).
Reproducibility was compared between the conventional and the mini-experiment design on the basis of behavioural and physiological differences between the four mouse strains (yielding 6
strain comparisons in total). In particular, two different approaches were taken to tackle the issue of reproducibility from a statistical perspective: (I) Consistency of the strain effects
across replicate experiments and (II) Estimation of how often and how accurately the replicate experiments predict the overall effect.
The consistency of the strain effect across replicate experiments is statistically reflected in the interaction term of the strain effect with the replicate experiments (‘strain-by-replicate
experiment’-interaction). To assess this interaction term for all outcome measures of each strain comparison, we applied a univariate linear mixed model (LMM; for details please see the
“Methods” section) to both designs. Usually, the contribution of a fixed effect to a model is determined by examining the F-values as they return the relative variance that is explained by
the term against the total variance of the data. As this LMM is a random effect mixed model accounting for the structure in the data and the F-values cannot be assessed for a random effect
(i.e. ‘strain-by-replicate-experiment’-interaction), we used the p-value of the interaction term as a proxy for the F-test (for details see “Methods” section; analysis adapted from Ref.32,
and see also Ref.44). A higher p-value of the interaction term indicates less impact of the replicate experiments on the consistency of the strain effect (i.e. better reproducibility).
Therefore, p-values of the interaction term for 16 representative outcome measures for each strain comparison were compared between the designs using a one-tailed Wilcoxon signed-rank test
(the selected measures covered all paradigms used, for details please see the “Methods” section).
The p-values of the ‘strain-by-replicate experiment’-interaction term were significantly higher in the mini-experiment than in the conventional design in 3 out of 6 strain comparisons,
demonstrating improved reproducibility among replicate experiments in the mini-experiment design in half of all investigated exemplary treatment effects (Fig. 2a, c, f; Wilcoxon signed-rank
test (paired, one-tailed, n = 16): Comparison 1 ‘C57BL/6J–DBA/2N’: V = 21, p-value = 0.047; Comparison 3 ‘C57BL/6J–B6D2F1N’: V = 11, p-value = 0.009; Comparison 6 ‘BALB/cN–B6D2F1N’: V = 5,
p-value = 0.003). For the remaining three strain comparisons, however, no significant differences were found (Fig. 2b, d, e; Wilcoxon signed-rank test (paired, one-tailed, n = 16):
Comparison 2 ‘C57BL/6J–BALB/cN’: V = 35.5, p-value = 0.429; Comparison 4 ‘DBA/2N–BALB/cN’: V = 59.5, p-value = 0.342; Comparison 5 ‘DBA/2N–B6D2F1N’: V = 48, p-value = 0.257). Here, both
designs were characterised by a high median p-value of the interaction term, reflecting a rather good reproducibility independent of the experimental design (see Fig. 2b, d, e).
Consistency of the strain effect across replicate experiments for all strain comparisons, respectively, of both the conventional (red) and the mini-experiment (blue) design. Shown are
p-values of the ‘strain-by-replicate experiment’-interaction term across all 16 outcome measures. Data are presented as boxplots showing medians, 25% and 75% percentiles, and 5% and 95%
percentiles. Black dots represent single p-values for each outcome measure in both designs. Statistics: Wilcoxon signed-rank test (paired, one-tailed, n = 16), *p ≤ 0.05.
Looking at single outcome measures for all strain comparisons of both designs, we found four significant strain-by-replicate experiment-interactions, but only in the conventional design (a
p-value of