Improving reproducibility in animal research by splitting the study population into several ‘mini-experiments’

Improving reproducibility in animal research by splitting the study population into several ‘mini-experiments’


Play all audios:

Loading...

ABSTRACT In light of the hotly discussed ‘reproducibility crisis’, a rethinking of current methodologies appears essential. Implementing multi-laboratory designs has been shown to enhance


the external validity and hence the reproducibility of findings from animal research. We here aimed at proposing a new experimental strategy that transfers this logic into a


single-laboratory setting. We systematically introduced heterogeneity into our study population by splitting an experiment into several ‘mini-experiments’ spread over different time points a


few weeks apart. We hypothesised to observe improved reproducibility in such a ‘mini-experiment’ design in comparison to a conventionally standardised design, according to which all animals


are tested at one specific point in time. By comparing both designs across independent replicates, we could indeed show that the use of such a ‘mini-experiment’ design improved the


reproducibility and accurate detection of exemplary treatment effects (behavioural and physiological differences between four mouse strains) in about half of all investigated strain


comparisons. Thus, we successfully implemented and empirically validated an easy-to-handle strategy to tackle poor reproducibility in single-laboratory studies. Since other experiments


within different life science disciplines share the main characteristics with the investigation reported here, these studies are likely to also benefit from this approach. SIMILAR CONTENT


BEING VIEWED BY OTHERS USING MICE FROM DIFFERENT BREEDING SITES FAILS TO IMPROVE REPLICABILITY OF RESULTS FROM SINGLE-LABORATORY STUDIES Article Open access 27 December 2023 IMPROVING


LABORATORY ANIMAL GENETIC REPORTING: LAG-R GUIDELINES Article Open access 02 July 2024 INCREASING THE STATISTICAL POWER OF ANIMAL EXPERIMENTS WITH HISTORICAL CONTROL DATA Article 18 February


2021 INTRODUCTION Concerns about the credibility of scientific results have become a major issue over the last years (e.g. Refs.1,2,3,4). This is aptly reflected by a recent survey of the


Nature publishing group, which reported that over 90% of the interviewed researchers were convinced that science currently faces a ‘reproducibility crisis’5. Further evidence for this


impression comes from several systematic replication studies in the field of biomedicine and psychology, where replication failed to an alarming extent6,7,8,9. Based on these studies, it was


estimated that about 50–90% of the published findings are in fact irreproducible10,11,12. The reasons associated with poor reproducibility are numerous, ranging from fallacies in the


experimental design and statistical analyses (e.g. p-hacking13 and HARKing14) to a lack of information in the published literature10. Building on this discussion, specific guidelines, such


as the TOP (Transparency and Openness Promotion)15, ARRIVE16,17, or PREPARE18 guidelines have been developed. Furthermore, to increase overall transparency in animal research and to


counteract the problem of publication bias, the pre-registration of studies has been encouraged19,20,21,22. All of these attempts are promising strategies to improve the planning, analysis


and reporting of studies23. However, as demonstrated by an already 20 years-old seminal study, the use of thoroughly planned and well-reported protocols does not automatically guarantee


reproducibility24. In this study, three different laboratories simultaneously conducted the same animal experiment under highly standardised conditions. More precisely, they compared eight


inbred mouse strains in a battery of six behavioural tests in each laboratory. Surprisingly, the three laboratories found remarkably different results, reflecting a typical example of what


we call poor reproducibility (i.e. failure of obtaining the same results when replicating a study with a new study population). This observation was most likely due to the influence of many


uncontrollable environmental background factors that affected the outcome of the experiment differently (e.g. noise25, microbiota26, or personnel27). To embrace this kind of unavoidable


variation within a single study and thereby to increase reproducibility, the idea to implement multi-laboratory study designs in animal research has been proposed recently28: In a simulation


approach, 50 independent studies on the effect of therapeutic hypothermia on infarct volume in rodent models of stroke were used to compare the reproducibility of treatment effects between


multi-laboratory and single-laboratory designs. And indeed, by re-analysing these data, the authors demonstrated that multi-laboratory studies produced much more consistent results than


single-laboratory studies28. However, multi-laboratory studies are logistically challenging and are not yet suitable to replace the broad mass of single-laboratory studies. Therefore,


solutions are urgently needed to tackle the problem of poor reproducibility at the level of single-laboratory studies. Against this background, the overall idea of the present study was to


design an experimental strategy that transfers the multi-laboratory logic into a single-laboratory setting, and at the same time offers a high degree of practical relevance. The essential


element of the above described multi-laboratory approach is the inclusion of heterogeneity within the study population. In particular, by mimicking the inevitably existing between-laboratory


variation within one study, the representativeness of the study population is enhanced. For instance, actively integrating background factors that are usually not in the focus of the study


such as changing personnel or the temperature, is supposed to render the results more generalisable. This way, the external validity is increased, leading to better reproducibility of


research outcomes across independent replicate experiments29,30. In line with this assumption, there is accumulating theoretical and empirical evidence that systematic and controlled


heterogenisation of experimental conditions within single laboratories also increases the reproducibility of research outcomes in comparison to rigorously standardised


experiments31,32,33,34. However, an effective and at the same time easy-to-apply heterogenisation strategy is still missing29,33,35. For the successful implementation of such a strategy, the


following two requirements have to be met: First of all, considering the logic of the multi-laboratory approach, factors need to be identified that inevitably vary between experiments as


they would vary between laboratories in a multi-laboratory setting and are not part of the study question. A promising and repeatedly highlighted candidate in this respect is the time of


testing throughout the year (referred to as ‘batch’ for example by Refs.29,36,37). This factor has not only been shown to substantially influence the phenotype of mice tested in the same


laboratory38,39, but can also be regarded as sort of an ‘umbrella factor’ for plenty of uncontrollable varying known and unknown background factors (e.g. changing personnel, noise,


temperature, etc.). By covering a diverse spectrum of background heterogeneity, variation of this factor thus automatically enhances the representativeness of the study population as the


variation of the laboratory environment does in the multi-laboratory approach. Second, a successful implementation critically relies on the feasibility of the approach and its potential to


introduce the necessary variation in a systematic and controlled way. Again, the time of testing throughout the year appears a promising factor: It can be easily varied in a systematic and


controlled way, and the implementation appears feasible as it simply implies to collect data over time. Against this background, we here propose and validate an experimental strategy that


builds on systematically varying the time of testing by splitting an experiment into several independent ‘mini-experiments’ conducted at different time points throughout the year (referred


to as ‘mini-experiment’ design in the following, see Fig. 1). In light of the above presented conceptual framework, we hypothesise to observe improved reproducibility of research findings in


such a mini-experiment design in comparison to a conventionally standardised design, according to which all animals are tested at one specific point in time (referred to as conventional


design in the following, see Fig. 1). RESULTS In experimental animal research, many studies examine the role of specific genes in the modulation of the phenotype and therefore rely on the


phenotypic characterisation of genetically modified animals. To mimic such an experiment with a typical ‘treatment under investigation’ (i.e. different genotypes), behavioural and


physiological differences between mouse strains were investigated in tests commonly used in such phenotyping studies40. In detail, male mice of the three inbred mouse strains C57BL/6J,


DBA/2N and BALB/cN, and the F1 hybrid strain B6D2F1N were tested in a battery of well-established behavioural and physiological paradigms. The battery included the examination of exploratory


and anxiety-like behaviours, hedonic states, cognitive abilities, nest building, spontaneous home cage behaviour, body weight changes, and hormonal profiles (i.e. corticosterone and


testosterone metabolite concentrations, for details see “Experimental phase” in the “Methods” section). Previous studies have shown that the selected strains differ, for instance, in their


anxiety-like and learning behaviour (C57BL/6 vs BALB/c and DBA/2)24,41,42,43, but not in their exploratory locomotion (C57BL/6 vs BALB/c)24,41. The above described examination was done in


two experimental designs, namely a mini-experiment design and a conventional design, which were then compared with respect to their effectiveness in terms of reproducibility. More precisely,


to examine the reproducibility of the strain differences in both designs, a total of four conventional and four mini-experiment replicate experiments were conducted successively over a time


period of 1.5 years (conventional design: Con 1–Con 4 and mini-experiment design: Mini 1–Mini 4, Fig. 1b). Within each conventional replicate experiment and each mini-experiment replicate


experiment, a sample size of 9 mice per strain were tested. In the mini-experiment design, however, one replicate experiment was split in three mini-experiments (Mini 1a, Mini 1b, Mini 1c,


Fig. 1b), each comprising a reduced number of animals at one specific point in time (i.e. 3 per strain). Thus, whereas all mice of one conventional replicate experiment (i.e. 9 per strain)


were delivered and tested at one specific point in time, in the mini-experiment design, they were delivered and tested in three mini-experiments that were conducted at different time points


throughout the year (i.e. 3 mice per strain per mini-experiment). Consequently, factors, such as the age of the animals at delivery or the age of the animals at testing were the same for all


mini-experiments and the conventional replicate experiments. Furthermore, concerning factors, such as for example temperature, personnel, and type of bedding, conditions were kept as


constant as possible within mini-experiments, whereas between mini-experiments conditions were allowed to vary (for details see Supplementary Table S1). This approach was taken to reflect


the usual fluctuations in conditions between independent studies. Both designs were organised according to a randomised block design. In the mini-experiment design, one ‘block’ corresponded


to one mini-experiment and in the conventional design, one ‘block’ subsumed cages of mice positioned within the same rack and tested consecutively in the battery of tests (for details see


Supplementary Fig. S1). Reproducibility was compared between the conventional and the mini-experiment design on the basis of behavioural and physiological differences between the four mouse


strains (yielding 6 strain comparisons in total). In particular, two different approaches were taken to tackle the issue of reproducibility from a statistical perspective: (I) Consistency of


the strain effects across replicate experiments and (II) Estimation of how often and how accurately the replicate experiments predict the overall effect. (I) CONSISTENCY OF THE STRAIN


EFFECTS ACROSS REPLICATE EXPERIMENTS The consistency of the strain effect across replicate experiments is statistically reflected in the interaction term of the strain effect with the


replicate experiments (‘strain-by-replicate experiment’-interaction). To assess this interaction term for all outcome measures of each strain comparison, we applied a univariate linear mixed


model (LMM; for details please see the “Methods” section) to both designs. Usually, the contribution of a fixed effect to a model is determined by examining the F-values as they return the


relative variance that is explained by the term against the total variance of the data. As this LMM is a random effect mixed model accounting for the structure in the data and the F-values


cannot be assessed for a random effect (i.e. ‘strain-by-replicate-experiment’-interaction), we used the p-value of the interaction term as a proxy for the F-test (for details see “Methods”


section; analysis adapted from Ref.32, and see also Ref.44). A higher p-value of the interaction term indicates less impact of the replicate experiments on the consistency of the strain


effect (i.e. better reproducibility). Therefore, p-values of the interaction term for 16 representative outcome measures for each strain comparison were compared between the designs using a


one-tailed Wilcoxon signed-rank test (the selected measures covered all paradigms used, for details please see the “Methods” section). The p-values of the ‘strain-by-replicate


experiment’-interaction term were significantly higher in the mini-experiment than in the conventional design in 3 out of 6 strain comparisons, demonstrating improved reproducibility among


replicate experiments in the mini-experiment design in half of all investigated exemplary treatment effects (Fig. 2a, c, f; Wilcoxon signed-rank test (paired, one-tailed, n = 16): Comparison


1 ‘C57BL/6J–DBA/2N’: V = 21, p-value = 0.047; Comparison 3 ‘C57BL/6J–B6D2F1N’: V = 11, p-value = 0.009; Comparison 6 ‘BALB/cN–B6D2F1N’: V = 5, p-value = 0.003). For the remaining three


strain comparisons, however, no significant differences were found (Fig. 2b, d, e; Wilcoxon signed-rank test (paired, one-tailed, n = 16): Comparison 2 ‘C57BL/6J–BALB/cN’: V = 35.5, p-value 


= 0.429; Comparison 4 ‘DBA/2N–BALB/cN’: V = 59.5, p-value = 0.342; Comparison 5 ‘DBA/2N–B6D2F1N’: V = 48, p-value = 0.257). Here, both designs were characterised by a high median p-value of


the interaction term, reflecting a rather good reproducibility independent of the experimental design (see Fig. 2b, d, e). Looking at single outcome measures for all strain comparisons of


both designs, we found four significant strain-by-replicate experiment-interactions, but only in the conventional design (a p-value of < 0.05 was detected in 4 out of 96 interactions,


Fig. 3 and Supplementary Tables S2–S7, for a graphical overview of all single outcome measures please see Supplementary Figs. S2–S7). Such a significant interaction term highlights a strain


effect that is not consistent across replicate experiments and thus indicates hampered reproducibility. Interestingly, as depicted in the forest plots in Fig. 3, some conventional replicate


experiments predicted strain effects in even opposite directions with non-overlapping confidence intervals (Fig. 3a, b; conventional replicate experiment 1 versus 3). Please note that these


events of hampered reproducibility rely on a descriptive comparison of the two designs and do not allow for inferential conclusions. (II) ESTIMATION OF HOW OFTEN AND HOW ACCURATELY THE


REPLICATE EXPERIMENTS PREDICT THE OVERALL EFFECT In the second analysis, reproducibility was investigated by comparing the performance of the designs to predict the overall effect size (see


Ref.28). The overall effect size (i.e. the mean strain difference) and corresponding 95% confidence intervals (CI95) were estimated from the data of replicate experiments of both designs by


conducting a random-effect meta-analysis (see “Methods” for details). This was completed for each outcome measure and each strain comparison. Within each replicate experiment of both


designs, individual effect sizes and CI95 were calculated (Fig. 3). Finally, the individual estimated effects of each replicate experiment were compared to the overall effects using the


following two measurements: First, the coverage probability (Pc) was assessed by counting how often the CI95 of the replicate experiments in each experimental design covered the overall


effect size (replicate experiments marked by ♣ in Supplementary Fig. S8). Second, the proportion of accurate results (Pa) was calculated by counting those replicate experiments that


predicted the overall effect size accurately with respect to their statistical significance (replicate experiments marked by ♦ in Supplementary Fig. S8). Subsequently, Pc and Pa ratios of


all 16 outcome measures were compared between both experimental designs using a one-tailed Wilcoxon signed-rank test. In line with our previous findings, the replicate experiments in the


mini-experiment design covered the overall effect significantly more often than the replicate experiments in the conventional design (higher Pc ratio) for two out of six comparisons (Fig. 


4a, c; Wilcoxon signed-rank test (paired, one-tailed, n = 16): Comparison 1 ‘C57BL/6J–DBA/2N’: V = 4.5, p-value = 0.012; Comparison 3 ‘C57BL/6J–B6D2F1N’: V = 3.5, p-value = 0.035).


Furthermore, focusing on these comparisons, the Pa was significantly higher in the mini-experiment than in the conventional design, again demonstrating better reproducibility in the


mini-experiment design (Fig. 5a, c; Wilcoxon signed-rank test (paired, one-tailed, n = 16): Comparison 1 ‘C57BL/6J–DBA/2N’: V = 10, p-value = 0.03; Comparison 3 ‘C57BL/6J–B6D2F1N’: V = 4.5,


p-value = 0.008). With respect to the remaining four strain comparisons, no significant differences in the Pc and the Pa between both designs could be found (Figs. 4b,d–f, 5b,d–f); Wilcoxon


signed-rank test (paired, one-tailed, n = 16): Pc: Comparison 2 ‘C57BL/6J–BALB/cN’: V = 4, p-value = 0.386; Comparison 4 ‘DBA/2N–BALB/cN’: V = 14, p-value = 0.5; Comparison 5


‘DBA/2N–B6D2F1N’: V = 2.5, p-value = 0.102; Comparison 6 ‘BALB/cN–B6D2F1N’: V = 12, p-value = 0.388; Pa: Comparison 2 ‘C57BL/6J–BALB/cN’: V = 42.5, p-value = 0.201; Comparison 4


‘DBA/2N–BALB/cN’: V = 12, p-value = 0.204; Comparison 5 ‘DBA/2N–B6D2F1N’: V = 10.5, p-value = 0.153; Comparison 6 ‘BALB/cN–B6D2F1N’: V = 20, p-value = 0.411). DISCUSSION In light of the


extensively discussed reproducibility crisis, introducing heterogeneity in the study population by implementing multi-laboratory study designs has been shown to enhance the external validity


and hence the reproducibility of findings from animal research28. We aimed at transferring this logic to single laboratories to introduce likewise heterogeneity in the study population in a


systematic and controlled way. In detail, an experimental strategy using independent mini-experiments spread over three different time points just a few weeks apart was applied and


empirically tested regarding its reproducibility of the results in comparison to a conventional standardised design. Indeed, we observed improved reproducibility of the results from the


mini-experiment design across independent replicates in about half of all investigated treatment effects. More precisely, improved reproducibility was reflected by a significantly lower


between-replicate experiment variation in three out of six strain comparisons. Furthermore, replicate experiments in the mini-experiment design predicted the overall found effect size


significantly more often and more accurately in two out of six strain comparisons. With respect to the remaining strain comparisons, in both experimental designs a rather good


reproducibility of strain differences was observed. For instance, the results of the comparison between the two inbred strains C57BL/6 and BALB/c turned out to be very robust in our study in


both designs. As previous studies found fluctuating results regarding exactly this strain difference24,41, it is likely that no further improvement by the mini-experiment design was


detectable due to a ceiling effect (i.e. nearly no variation of the strain effect across the replicate experiments was detectable in both experimental designs). Besides these overall


effects, we also investigated outcome measures separately and detected events of severely hampered reproducibility. Remarkably, these problems exclusively occurred in the conventional design


with strain differences pointing in opposite directions. Similar to the landmark study of Crabbe et al.24, these observations represent again examples of irreproducibility that are not due


to a lack of planning or reporting standards, but to the high idiosyncrasy of results from rigorously standardised experiments. This problem has been discussed to be particularly acute in


animal research, as animals and other living organisms are highly responsive in their phenotype to environmental changes. This phenotypic plasticity45 can result in an altered response (i.e.


reaction norm46) towards a treatment effect depending on the environmental conditions of an experiment30. Therefore, even subtle and inevitable changing experimental conditions may lead to


completely contradicting conclusions about the treatment under investigation (in our case strain differences). This might even occur when an experiment is replicated in the same laboratory,


but for example the animals are purchased by another vendor47, tested by another experimenter48 or at another time of the day34. In line with these examples, 50% of interviewed scientists


stated they have experienced failures in replicating their own experiments5. This prevalence of idiosyncratic findings in the literature highlights the need for a strategy that decreases the


risk of finding results which are only valid under narrowly defined conditions (e.g. one specific experimenter) and therefore not of biological interest. Introducing heterogeneity by means


of a mini-experiment design efficiently reduced the risk to obtain replicate-specific and hence irreproducible findings in the present study. This is in line with accumulating


theoretical28,31,34 and empirical32,33,49 evidence that systematic heterogeneity in the study population plays an important role for avoiding spurious findings. In contrast to previously


suggested heterogenisation strategies, however, the proposed mini-experiment design does not require the variation of specific, a priori defined experimental factors, such as for example the


age of the experimental subjects or the housing conditions32,33. Instead, this strategy uses the heterogenisation factor ‘time point throughout the year’ which includes known and unknown


background factors that uncontrollably vary over time and hence differ automatically between mini-experiments. As a consequence, the mini-experiment design utilises in particular those


background factors, which are typically neither controlled for nor systematically investigated as factors of interest (e.g. noise, changing personnel, season). Please note that the extent to


which these background factors vary might differ from study to study. Since the efficiency of the here proposed heterogenisation strategy is linked to the variation of these known and


unknown background factors over time, studies can benefit to different extents. Similar to the logic of the multi-laboratory approach, the mini-experiment design covers the inevitably


existing between-replicate variation within one experiment (i.e. each mini-experiment in the mini-experiment design is analogous to one laboratory in the multi-laboratory design) to make the


results more robust across the variation that we usually observe between independent studies. Whether the mini-experiment design provides a solution to improve the reproducibility of


results not only across independent replicates in the same laboratory, but also across different laboratories, however, needs further validation in a real-life, multi-laboratory situation.


Furthermore, the mini-experiment design stands out by its practical feasibility. Although the time span to conduct a study is expanded by spreading the experiment across time, it provides


several practical benefits in return. First, in contrast to a conventional experiment, in each mini-experiment a reduced number of animals is tested per time point. This is particularly 


beneficial for studies with genetically modified animals. Transgenic or knockout mouse models, for example, are often characterised by small breeding rates, so that not all experimental


subjects are born at one specific time point, but over the course of several weeks (e.g. phenotyping studies of genetically modified mice50). As a mini-experiment design relies on testing


animals in small successive batches, it provides a systematic solution for ‘collecting’ data of these animals over time. Second, time-demanding experiments, such as for example complex


learning tasks (e.g. Ref.51), may benefit from a mini-experiment design, because the workload per day is drastically reduced. Despite these conceptual and practical advantages of utilising


more heterogeneous study populations in animal research, rigorously standardised experimental conditions are still accepted to be the gold-standard, also for ethical reasons. By reducing


variation in the study population, standardisation is assumed to increase test sensitivity and thereby to reduce the number of animals needed to detect an effect52. Following this logic, one


could assume that more heterogeneous study populations would require more animals to detect a significant effect. Indeed, this argument may hold true for the introduction of uncontrolled


variation in the data (i.e. noisy data)53. However, the mini-experiment design introduces variation in a systematic and controlled way by using a randomised block design where each


mini-experiment corresponds to one ‘block’. In fact, such designs have been suggested to be particularly powerful and of high external validity as long as the random blocking factor is


considered in the statistical analysis54. In line with this, our study demonstrate that the mini-experiment design led to more reproducible results compared to the conventionally


standardised design in about half of all investigated treatment effects, even though the total number of animals used was identical. Moreover, the mini-experiment design led to a higher


proportion of accurate results (Pa) and hence provided more accurate conclusions than the conventional design in one third of the investigated treatment effects. These findings highlight


that the above presented argument of increased test sensitivity through rigorous standardisation might come at the cost of obtaining less accurate and hence idiosyncratic results (see also


Ref.28). In line with this, a comprehensive simulation study found that testing in multiple batches (i.e. mini-experiments) provides more confidence in the results than testing in one ‘big’


batch55. Therefore, arguing from a 3R-perspective56 (i.e. Replace, Reduce, Refine), a mini-experiment design increases the informative value of an experiment without requiring higher sample


sizes than traditional designs. It thus contributes to both the ‘Refinement’ of experiments by enhancing the external validity and reproducibility of findings, and the ‘Reduction’ of animal


numbers by decreasing the need for obsolete replication studies28,57. In conclusion, we here proposed and empirically validated a novel mini-experiment design, which may serve as an


effective and at the same time easy-to-apply heterogenisation strategy for single-laboratory studies. We believe that particularly those studies may benefit from a mini-experiment design


that can be influenced by seasonal changes in the experimental background, as this approach fosters greater generalisability and thereby helps to avoid idiosyncratic (i.e. time-specific)


results. Although tested by the example of behavioural and physiological mouse strain differences, benefits of applying such a design may not be limited to the field of animal research. In


this respect, a study investigating grass-legume mixtures in a simple microcosm experiment has already shown that the introduction of heterogeneity on the basis of genetic and environmental


variation in the experimental design can also enhance the reproducibility of ecological studies49. Therefore, using a mini-experiment design to include heterogeneity systematically in the


study population may also benefit the reproducibility in other research branches of the life sciences. METHODS ANIMALS AND HOUSING CONDITIONS For this study, male mice of three inbred


(C57BL/6J, DBA/2N and BALB/cN) and one F1 hybrid strain (B6D2F1N) were provided by one supplier (Charles River Laboratories). All mice arrived at an age of about 4 weeks (PND 28). Upon


arrival, the animals were housed in same strain groups of three mice per cage until PND 65 ± 2. Thereafter, mice were transferred to single housing conditions to avoid any kind of severe


inter-male aggression within group housing conditions (for ongoing discussions about how to house male mice please see Refs.58,59). In both phases (group and single housing conditions), the


animals were conventionally housed in enriched Makrolon type III cages (38 cm × 22 cm × 15 cm) filled with bedding material (the standard bedding material used in our facility changed over


the course of this study from Allspan, Höveler GmbH & Co.KG, Langenfeld, Germany to Tierwohl, J. Reckhorn GmbH & Co.KG, Rosenberg, Germany) and equipped with a tissue paper as


nesting material, a red transparent plastic house (Mouse House, Tecniplast Deutschland GmbH, Hohenpeißenberg, Germany) and a wooden stick. Food pellets (Altromin 1324, Altromin Spezialfutter


GmbH & Co. KG, Lage, Germany) and tap water were provided ad libitum. Health monitoring took place and cages were cleaned weekly in the group housing phase and fortnightly afterwards.


Enrichment was replaced fortnightly in both phases. Housing rooms were maintained at a 12/12 h light/dark cycle with lights off at 9:00 a.m., a temperature of about 22 °C and a relative


humidity of about 50%. ETHICS STATEMENT All procedures complied with the regulations covering animal experimentation within Germany (Animal Welfare Act) and the EU (European Communities


Council DIRECTIVE 2010/63/EU) and were approved by the local (Gesundheits-und Veterinäramt Münster, Nordrhein-Westfalen) and federal authorities (Landesamt für Natur, Umwelt und


Verbraucherschutz Nordrhein-Westfalen “LANUV NRW”, reference number: 84-02.04.2015.A245). EXPERIMENTAL PHASE To examine differences between the four mouse strains, all mice of all replicate


experiments were subjected to the same experimental testing procedures, including the investigation of exploratory and anxiety-like behaviours, hedonic states, cognitive abilities, nest


building and spontaneous home cage behaviour, bodyweight changes, and hormonal profiles. This was done to reflect data from both behavioural and physiological measurements. Behavioural


paradigms were chosen in accordance to established protocols for the phenotypic characterisation of mice in animal research40. In each replicate experiment, 9 mice per strain (n = 9) were


tested, except for two replicate experiments in the mini-experiment design (Mini 1, Mini 2). Here the sample size of one strain (‘B6D2F1N’) was reduced to n = 8, due to the death of 2 mice


before the start of the experimental phase. The experimental phase started for all animals on postnatal day (PND) 73 ± 1 and lasted for three weeks. Spontaneous home cage behaviour (HCB) was


observed on six days between PND 73 ± 1 and PND 93 ± 1 and faecal samples to determine corticosterone and testosterone metabolite concentrations were collected on PND 87. Additionally, the


bodyweight of the animals on PND 65 ± 2 as well as the weight gain over the test phase, from PND 73 ± 1 to PND 93 ± 1, was measured. Throughout the experimental phase, the following tests


were conducted during the active phase in the same order for all animals: Elevated Plus Maze (EPM) on PND 77 ± 2, Open Field test (OF) on PND 79 ± 2, Novel Cage test (NC) on PND 83, Barrier


test on PND 84, Puzzle Box test (PB) on PND 85 ± 1, Sucrose Consumption test (SC) starting on PND 87 and Nest test (NT) starting on PND 91 (see Fig. 1). The EPM, OF and PB were conducted


under dim light conditions in a testing room separated from the housing room. The NC, Barrier, SC and NT were conducted in the housing room under red light conditions. In all paradigms, the


order of the mice was pseudo-randomised following two rules. First, always four mice, one out of each strain, were tested consecutively. Second, the order of these four mice was randomised


with respect to the strains. All mice, regardless of the replicate experiment, were tested by the same experienced experimenter in all test procedures, except for the Nest test. In the


latter, two experimenters scored the behavioural data of all mice simultaneously, regardless of the replicate experiment (for detail see the section “Nest test” below). As strains had


different fur colours (C57BL/6J and B6D2F1N mice: black, DBA/2N mice: brown, BALB/cN mice: white), blinding was not possible at the level of the exemplary treatment groups. Therefore, it


cannot be excluded that the experimenter might have unconsciously influenced the animal’s behaviour and thus, the observed strain differences. However, we were not interested in the actual


strain differences, but instead investigated the reproducibility of these strain comparisons. The reproducibility of strain differences across replicate experiments, however, was unlikely to


be influenced by the presence or absence of blinding procedures at the level of strains. Importantly, the crucial level of blinding in this study was based on the experimental design. For


this reason, the experimenters were blind with respect to the allocation of the mice to the experimental design (conventional or mini-experiment), whenever experiments in both designs were


conducted at the same time. This involved four out of twelve mini-experiments (one from each replicate experiment) and all four conventional replicate experiments. By this, the experimenters


were not aware which animals were tested in the conventional standardised design at all. In the following, details on the test procedures during the experimental phase are given: ELEVATED


PLUS MAZE TEST The EPM was conducted to examine exploratory locomotion and anxiety-like behaviour of the animals60. The apparatus was elevated 50 cm above the ground and was composed of two


opposing open (30 cm × 5 cm) and two opposing closed arms (30 cm × 5 cm) which were connected via a central square (5 cm × 5 cm). The closed arms were surrounded by 20 cm high walls. The EPM


was illuminated from above (25 lx in the centre square). After spending 1 min in an empty box protected from light, the mouse was placed on the central platform facing a closed arm and was


allowed to freely explore the apparatus for 5 min. During that time, the animal was recorded by a webcam (Webcam Pro 9000, Logitech) in the absence of the experimenter. Outcome measures


taken were the relative amount of entries into and the relative time spent in the open arms [i.e. open arm entries or time/(open arm + closed arm entries or time)]. In addition, the number


of protected head dips (‘mouse lowers its head over the side of an open arm with its ears protruding over the edge, while at least one paw remains in the closed segment or central platform’;


cf.33) was calculated relative to the total amount of head dips shown. OPEN FIELD TEST Similar to the EPM, the OF is a paradigm to assess the exploratory locomotion and anxiety-like


behaviour of mice61. The apparatus consisted of a square arena (80 cm × 80 cm) surrounded by 40 cm high walls and illuminated from above (35 lx). A centre zone was defined as a 40 cm × 40 cm


square area located in 20 cm distance from all walls. After spending 1 min in an empty box protected from light, the animal was placed in one corner of the OF facing the wall and was


allowed to freely explore the arena for 5 min. During that time, the animal was recorded by a webcam (Webcam Pro 9000, Logitech) in the absence of the experimenter. The number of entries


into the centre zone, the distance travelled and the time spent in the centre zone as well as the total distance travelled in the OF was automatically analysed by the video tracking software


ANY-maze (Version 4.99 or 5.31, Stoelting Co.). In addition, the number of faecal boli in the OF was counted. NOVEL CAGE TEST In the NC, exploratory locomotion was observed in a new


environment by resembling a cage cleaning routine62. A new empty Makrolon type III cage (standard housing cage) was filled with 1 L of bedding material. Mice were placed into this new


housing cage and the frequency of ‘rearing’ (‘a mouse raises itself on its hindpaws and stretches its snout into the air’) was recorded as a measure of vertical exploration by the


experimenter for a duration of 5 min. BARRIER TEST The Barrier test apparatus consisted of an empty Makrolon cage type III (standard housing cage), which was divided in half by a 3 cm high


acrylic, transparent barrier. The mice were placed in one half facing the wall and the latency to climb over the barrier was measured to assess the exploratory locomotion63. The maximum


duration of the test was set to 5 min. PUZZLE BOX TEST The PB test is a simple task to assess learning and problem-solving ability in mice (based on Ref.64). The rectangular shaped apparatus


had the dimensions 75 cm × 28 cm × 25 cm and was divided in a light (60 cm × 28 cm) and a dark (15 cm × 28 cm) compartment illuminated from above (40 lx). The dark compartment served as a


goal box and was connected to the light compartment via a u-shaped channel (4 cm × 4 cm × 8 cm) and a rectangular shaped doorway (4 cm × 4 cm). Each mouse conducted three consecutive trials,


in which it was initially placed in the light compartment facing the channel and the latency to reach the goal box was measured. While in the first and second trial the channel provided


free entrance into the goal box, in the third trial this channel was filled with 100 ml bedding material to create an obstruction for entering the goal box. Therefore, the latency to reach


the goal box in the third trial represents not just learning, but also the problem-solving ability of an individual by means of overcoming an obstacle it has not encountered before. Between


all trials the used bedding material was removed and the apparatus cleaned thoroughly with 70% ethanol. The parameters taken into account were the change in latency from the first to the


second trial and the latency to reach the goal box in the third trial. FAECAL SAMPLES To determine stress hormone and testosterone levels non-invasively, faecal corticosterone (FCMs) and


testosterone metabolites (FTMs) were measured. Therefore, on PND 87 mice were transferred for a duration of 3 h (1 p.m.–4 p.m.) into new Makrolon type III cages to collect faecal samples.


These collecting cages were equipped with a thin layer of bedding material and new enrichment (MouseHouse, tissue paper and a wooden stick). After the 3 h collecting phase, all faecal boli


defecated were sampled and frozen at − 20 °C. Samples were dried and homogenised and an aliquot of 0.05 g each was extracted with 1 ml of 80% methanol. Finally, a


5α-pregnane-3β,11β,21-triol-20-one enzyme immunoassay to determine FCMs and a testosterone enzyme immunoassay to determine FTMs was used. Both were established and successfully validated to


measure FCMs and FTMs in mice, respectively (for details see Refs.65,66,67). Intra- and inter-assay coefficients of variation were all below 10%. SUCROSE CONSUMPTION TEST The SC is commonly


used to investigate anhedonia in rodents68. Therefore, the mice’s preference for sweet, saccharated solutions in comparison to tap water was examined. In detail, the mice had for 72 h free


access to two bottles, one containing tap water and the other one was filled with 3%—sucrose solution. Parameters measured were the total liquid and the relative amount of sucrose solution


intake. Please note, due to technical reasons (i.e. spilled bottles) data points of 4 animals had to be excluded for this test. NEST TEST To assess the nest building ability of the animals,


one hour prior to the onset of the dark phase, the shelter and tissue paper were removed from the cages for 24 h and a cotton nestlet (NES3600, Ancare) was provided. The quality of the nests


was scored independently by two experimenters after 5 and 24 h. The definition of scores was adopted from Deacon69 and ranged from 1 to 5. For each time point the assigned scores of the two


experimenters were averaged. HOME CAGE BEHAVIOUR Spontaneous (i.e. undisturbed) home cage behaviour was recorded on six days (PND 73 ± 1, PND 80 ± 1, PND 81 ± 1, PND 86, PND 92 and PND 93 ±


 1) during the experimental test phase in the housing room. Observations were conducted during the active phase (9.15 a.m.–8.15 p.m.) under red light conditions by an experienced observer.


On each of the 6 days, one observation session took place and lasted for 60 min. During each session, all mice of one experiment/mini-experiment were observed consecutively. The six


observation sessions were conducted at different times of the day. They were evenly distributed across the active phase to enhance the generalisability of the observed behaviour. The


behaviour of each mouse was recorded by instantaneous, focal animal sampling at intervals of 6 min, i.e. 10 times per session, thus resulting in 60 observations per mouse in total. The


observed outcome measures were the relative amount of active observations (i.e. activity level) and the percentage of active observations, in which an animal was either observed ‘climbing’


(i.e. locomotion) or ‘drinking/feeding’ (i.e. maintenance behaviour). For definitions of observed behaviours see Supplementary Table S8. Please note, due to technical reasons one observation


session of one mouse had to be excluded, resulting in 50 instead of 60 observations for this animal in total. STATISTICS The analyses described in the following were conducted using 16 out


of 23 outcome measures derived from 10 experimental test procedures. This selection was necessary to avoid any dependencies between several outcome measures derived from one test procedure.


The selected 16 outcome measures (see Supplementary Tables S2–S7) had a correlation coefficient < 0.5 and were therefore assumed to be independent (cf.49). The whole selection process was


completed by an experimenter blind to the specific outcome measures so that any biases in the selection process could be avoided. Outcome measures for exclusion were determined in a way


that as few outcome measures as possible had to be excluded to warrant non-dependency. Whenever only two outcome measures were correlated with each other, it was randomly chosen which one


was excluded. The whole selection process was done before the analyses I and II (see below) were conducted. For the main analysis, the reproducibility of the strain comparisons was assessed


and compared between both experimental designs using the following two approaches (I) Consistency of the strain effect across replicate experiments and (II) Estimation of how often and how


accurately the replicate experiments predict the overall effect. (I) CONSISTENCY OF THE STRAIN EFFECT ACROSS REPLICATE EXPERIMENTS To assess the ‘strain-by-replicate experiment’-interaction


as a measurement for reproducibility, the following linear mixed model (LMM, Eqs. 1a and 1b) was applied to both designs (conventional and mini-experiment). $$y_{ijmk} =_{{}} \mu + a_{i} +


b_{j} + {\text{ c}}_{{{\text{ij}}}} + d_{m} + f_{im} + \varepsilon_{ijmk} ,$$ (1a) where _i_ = 1, …, _n__S_, _j_ = 1, …, _n__R_, _m_ = 1,…, _n__B_ and _k_ = 1,…, _n__ijm_. _a__i_ indicates


the main effect of the _i_th level of strain (treatment); _b__j_ represents replicate experiments as a random effect where _b__j_ ~ N(0,σb2); _c__ij_ represents strain-by-replicate


experiment-interaction as random effect where _c__ij_ ~ N(0,σc2); _d__m_ represents block as a random effect where _d__m_ ~ N(0,σd2); _f__im_ represents strain-by-block-interaction as a


random effect where _f__im_ ~ N(0,σf2) and the error term _ε__ijmk_ ∼ N(0,σe2). or written in layman terms: $$y = {\text{`strain'}} + {\text{`replicate experiment'}} +


{\text{`strain }} \times {\text{ replicate experiment'}}+ {\text{`block'}}+ {\text{`block}} \times {\text{strain'}},$$ (1b) where ‘strain’ was included as fixed factor and


‘replicate experiment’, ‘strain-by-replicate experiment’-interaction, ‘block’ and ‘block-by-strain’-interaction as random factors. The factor ‘block’ was included in accordance to the


randomised block design used, in which mice sharing the same micro-environment were treated as one ‘block’ (i.e. same housing rack and testing time window, see Supplementary Fig. S1). In the


mini-experiment design, each ‘block’ corresponded also to one mini-experiment. To meet the assumptions of parametric analysis, residuals were graphically examined for normal distribution,


homoscedasticity, and the Shapiro–Wilk test was applied. When necessary, raw data were transformed using square root, inverse or logarithmic transformations (see Supplementary Tables S2–S7).


Typically, the contribution of a fixed effect to a model is assessed by examine the F-values as they return the relative variance that is explained by the term against the total variance of


the data. For a random effect, however, the F-values cannot be determined. Since the sample size in all replicate experiments was the same, the p-values of the ‘strain-by-replicate


experiment’-interaction term were used as a proxy for the F-values. The p-values are a function of the chi-square value of a Likelihood Ratio test assessing the random effect and the degrees


of freedom in the model. The degrees of freedom can be assumed to be the same in the analysis of both designs, since in both experimental designs the same sample size is used and the models


have the same structure regarding the applied factor levels. Concerning the interaction term, higher p-values indicate more consistency of the strain effect across replicate experiments and


thus better reproducibility. Subsequently, the p-values of the ‘strain-by-replicate experiment’-interaction term of all 16 outcome measures were compared between the conventional and


mini-experiment design by using the Wilcoxon signed-rank test (paired, one-tailed) (statistical methodology adapted from the analysis of Ref.32). (II) ESTIMATION OF HOW OFTEN AND HOW


ACCURATELY THE REPLICATE EXPERIMENTS PREDICT THE OVERALL EFFECT In the second analysis, the performance of each experimental design to predict the overall effect size was assessed by the


coverage probability (Pc) and the proportion of accurate results (Pa). First, the overall effect size of each outcome measure and strain comparison were estimated by conducting a


random-effect meta-analysis on the data of all replicate experiments independent of the experimental design. In detail, individual strain effect sizes and corresponding standard errors were


calculated by applying the following linear mixed model (Eqs. 2a and 2b) to the data of each replicate experiment, separately. $$y_{imk} = \mu + a_{i} + d_{m} + f_{im} + \varepsilon_{imk},$$


(2a) where _i_ = 1, …, _n__S_, _m_ = 1,…, _n__B_ and _k_ = 1,…, _n__ijm_. _a__i_ indicates the main effect of the _i_th level of strain (treatment); _d__m_ represents block as a random


effect _d__m_ ~ N(0,σd2); _f__im_ represents strain-by-block-interaction as a random effect _f__im_ ~ N(0,σf2) and the error term _ε__imk_ ∼ N(0,σe2). or written in layman terms: $$y=


{\text{`strain'}} +{\text{`block'}}+ {\text{`block }} \times {\text{ strain'}},$$ (2b) where ‘strain’ was included as fixed effect and ‘block’ and the


‘block-by-strain’-interaction as random factors to account for the structure of the randomised block design in each replicate experiment (for details see section above and Supplementary Fig.


S1). The random-effect meta-analysis was based on the individual strain effect sizes and standard errors of all replicate experiments. It was conducted using the R-package ‘metafor’70


(Version 2.1.0) to return the overall effect sizes and corresponding CI95 of each outcome measure and strain comparison using following mixed effect model (Eqs. 3a and 3b). $$S_{i} =_{{}}


\mu + f_{i} + \varepsilon_{i} ,$$ (3a) where _i_ = 1, …, _n__R_. _S__i_ represents the estimated strain effect sizes. _f__i_ indicates replicate experiment as a random effect and the error


term _ε__i_ ∼ N(0,σe2). or written in layman terms: $${\text{y}} = {\text{`replicate experiment'}}.$$ (3b) Following this step, individual mean strain differences and CI95 were computed


based on the LMM in Eqs. (2a and 2b) using the R-package ‘lsmeans’71 (Version 2.30.0) for each replicate experiment of both designs, separately. In a next step, the Pc and Pa were assessed


for each design. The Pc was calculated by counting how often the CI95 of the replicate experiments covered the overall effect size, whereas the Pa was determined by how often the replicate


experiments in each design predicted the overall effect accurately concerning its statistical significance. For the latter, it was examined whether the CI95 of the overall pooled effect


overlapped with 0 (i.e. overall not significant effect) or not (i.e. overall significant effect). In a next step, the Pa was calculated by counting all replicate experiments of each design


that predicted the overall effect accurately. In detail, two requirements had to be met for a replicate experiment to be counted as predicting the overall effect accurately. The CI95 of the


replicate experiment had to include the overall effect and if CI95 of the overall effect included 0, then the CI95 of the replicate experiment also had to include 0. In the end, similar to


the p-values of the interaction-term in the first analysis, the Pc and Pa ratios of all 16 outcome measures were compared between both designs by using the Wilcoxon signed-rank test (paired,


one-tailed). All statistical analyses were conducted and graphs created using the statistical software package ‘R’72, except for testing the correlation of outcome measures IBM SPSS


Statistics (IBM Version 23) was used. Differences were considered to be significant for p ≤ 0.05. DATA AVAILABILITY The raw and processed data of the current study as well as the code for


the analyses are available in the Figshare repositories. https://figshare.com/s/f4f219a35128dc70bb68 and https://figshare.com/s/afbc4523e12b58cd8140 and


https://figshare.com/s/f164b25df58a2cb2dde4 REFERENCES * McNutt, M. Reproducibility. _Science_ 343, 229. https://doi.org/10.1126/science.1250475 (2014). Article  ADS  CAS  PubMed  Google


Scholar  * Drucker, D. J. Crosstalk never waste a good crisis: Confronting reproducibility in translational research crosstalk. _Cell Metab._ 24, 348–360 (2016). Article  CAS  PubMed  Google


Scholar  * Reed, W. R. For the student a primer on the ‘ reproducibility crisis ’ and ways to fix it. _Aust. Econ. Rev._ 51, 286–300 (2018). Article  Google Scholar  * Samsa, G. &


Samsa, L. A guide to reproducibility in preclinical research. _Acad. Med._ 94, 47–52 (2019). Article  PubMed  Google Scholar  * Baker, M. 1,500 scientists lift the lid on reproducibility.


_Nature_ 533, 452–454 (2016). Article  ADS  CAS  PubMed  Google Scholar  * Begley, C. G. & Ellis, L. M. Raise standards for preclinical cancer research. _Nature_ 483, 531–533 (2012).


Article  ADS  CAS  PubMed  Google Scholar  * Nosek, B. A. & Errington, T. M. Reproducibility in cancer biology: Making sense of replications. _Elife_ 6, e23383 (2017). Article  PubMed 


PubMed Central  Google Scholar  * Open Science Collaboration. Estimating the reproducibility of psychological science. _Science_ 349, aac4716 (2015). Article  CAS  Google Scholar  * Prinz,


F., Schlange, T. & Asadullah, K. Believe it or not: How much can we rely on published data on potential drug targets ?. _Nat. Publ. Gr._ https://doi.org/10.1038/nrd3439-c1 (2011).


Article  Google Scholar  * Begley, C. G. & Ioannidis, J. P. A. Reproducibility in science: Improving the standard for basic and preclinical research. _Circ. Res._ 116, 116–126 (2015).


Article  CAS  PubMed  Google Scholar  * Branch, M. N. The, “ Reproducibility Crisis: ” Might the methods used frequently in behavior-analysis research help?. _Perspect. Behav. Sci._ 42,


77–89 (2019). Article  PubMed  Google Scholar  * Freedman, L. P., Cockburn, I. M. & Simcoe, T. S. The economics of reproducibility in preclinical research. _PLoS Biol._ 13(6), 1–9.


https://doi.org/10.1371/journal.pbio.1002165 (2015). Article  CAS  Google Scholar  * Head, M. L., Holman, L., Lanfear, R., Kahn, A. T. & Jennions, M. D. The extent and consequences of


P-hacking in science. _PLoS Biol._ 13(3), 1–15. https://doi.org/10.1371/journal.pbio.1002106 (2015). Article  CAS  Google Scholar  * Kerr, N. L. HARKing: Hypothesizing after the results are


known. _Personal. Soc. Psychol. Rev._ 2, 196–217 (1998). Article  CAS  Google Scholar  * Nosek, B. A. _et al._ Promoting an open research culture. _Science_ 348, 1422–1425 (2015). Article 


ADS  CAS  PubMed  PubMed Central  Google Scholar  * Kilkenny, C., Browne, W., Cuthill, I. C., Emerson, M. & Altman, D. G. Animal research: Reporting in vivo experiments: The ARRIVE


guidelines. _Br. J. Pharmacol._ 160, 1577–1579. https://doi.org/10.1111/j.1476-5381.2010.00872.x (2010). Article  CAS  PubMed  PubMed Central  Google Scholar  * Percie du Sert N, Hurst V,


Ahluwalia A, Alam S, Avey MT, Baker M, et al. The ARRIVE guidelines 2.0: Updated guidelines for reporting animal research. _PLoS Biol._ 18 (7), e3000410.


https://doi.org/10.1371/journal.pbio.3000410 (2020) Article  CAS  PubMed  PubMed Central  Google Scholar  * Smith, A. J., Clutton, R. E., Lilley, E., Hansen, K. E. A. & Brattelid, T.


PREPARE: Guidelines for planning animal research and testing. _Lab. Anim._ 52, 135–141 (2018). Article  CAS  PubMed  Google Scholar  * Nosek, B. A. & Lakens, D. Editorial registered


reports. _Soc. Psychol._ 45, 137–141 (2014). Article  Google Scholar  * Center for Open Science https://osf.io/ (2020). * Wharton University of Pennsylvania, Credibility Lab, AsPredicted


https://aspredicted.org/ (2020). * German Federal Institute for Risk Assessment, Animal Study Registry https://www.animalstudyregistry.org/ (2020). * NPQIP Collaborative Group. Did a change


in Nature journals’ editorial policy for life sciences research improve reporting?. _BMJ Open Sci._ https://doi.org/10.17605/OSF.IO/HC7FK (2019). Article  Google Scholar  * Crabbe, J. C.,


Wahlsten, D. & Dudek, B. C. Genetics of mouse behavior: Interactions with laboratory environment. _Science_ 284, 1670–1672 (1999). Article  ADS  CAS  PubMed  Google Scholar  *


Castelhano-Carlos, M. J. & Baumans, V. The impact of light, noise, cage cleaning and in-house transport on welfare and stress of laboratory rats. _Lab. Anim._ 43, 311–327 (2009). Article


  CAS  PubMed  Google Scholar  * Leystra, A. A. & Clapper, M. L. Gut microbiota influences experimental outcomes in mouse models of colorectal cancer. _Genes_ 10, 900 (2019). Article 


CAS  PubMed Central  Google Scholar  * Sorge, R. E. _et al._ Olfactory exposure to males, including men, causes stress and related analgesia in rodents. _Nat. Methods_ 11, 629–632 (2014).


Article  CAS  PubMed  Google Scholar  * Voelkl, B., Vogt, L., Sena, E. S. & Würbel, H. Reproducibility of preclinical animal research improves with heterogeneity of study samples. _PLoS


Biol._ 16, 1–13. https://doi.org/10.1371/journal.pbio.2003693 (2018). Article  CAS  Google Scholar  * Richter, S. H. Systematic heterogenization for better reproducibility in animal


experimentation. _Lab. Anim. (NY)_ 46, 343 (2017). Article  Google Scholar  * Voelkl, B. _et al._ Reproducibility of animal research in light of biological variation. _Nat. Rev. Neurosci._


21, 384–393. https://doi.org/10.1038/s41583-020-0313-3 (2020). Article  CAS  PubMed  Google Scholar  * Richter, S. H., Garner, J. P. & Würbel, H. Environmental standardization: Cure or


cause of poor reproducibility in animal experiments?. _Nat. Methods_ 6, 257–261 (2009). Article  CAS  PubMed  Google Scholar  * Richter, S. H., Garner, J. P., Auer, C., Kunert, J. &


Würbel, H. Systematic variation improves reproducibility of animal experiments. _Nat. Methods_ 7, 167–168 (2010). Article  CAS  PubMed  Google Scholar  * Richter, S. H. _et al._ Effect of


population heterogenization on the reproducibility of mouse behavior: A multi-laboratory study. _PLoS ONE_ 6, e16461 (2011). Article  ADS  CAS  PubMed  PubMed Central  Google Scholar  *


Bodden, C. _et al._ Heterogenising study samples across testing time improves reproducibility of behavioural data. _Sci. Rep._ 9, 1–9. https://doi.org/10.1038/s41598-019-44705-2 (2019).


Article  CAS  Google Scholar  * Richter, S.H., von Kortzfleisch, V. It is time for an empirically informed paradigm shift in animal research. _Nat. Rev. Neurosci._ 1, 1.


https://doi.org/10.1038/s41583-020-0369-0 (2020). Article  Google Scholar  * Bailoo, J. D., Reichlin, T. S. & Würbel, H. Refinement of experimental design and conduct in laboratory


animal research. _ILAR J._ 55, 383–391 (2014). Article  CAS  PubMed  Google Scholar  * Paylor, R. Questioning standardization in science Footprints by deep sequencing. _Nat. Methods_ 6,


253–254 (2009). Article  CAS  PubMed  Google Scholar  * Chesler, E. J., Wilson, S. G., Lariviere, W. R., Rodriguez-Zas, S. L. & Mogil, J. S. Identification and ranking of genetic and


laboratory environment factors influencing a behavioral trait, thermal nociception, via computational analysis of a large data archive. _Neurosci. Biobehav. Rev._ 26, 907–923 (2002). Article


  PubMed  Google Scholar  * Karp, N. A. _et al._ Impact of temporal variation on design and analysis of mouse knockout phenotyping studies. _PLoS ONE_ 9, e111239 (2014). Article  ADS  PubMed


  PubMed Central  CAS  Google Scholar  * Lad, H. V. _et al._ Physiology and behavior behavioural battery testing: Evaluation and behavioural outcomes in 8 inbred mouse strains. _Physiol.


Behav._ 99, 301–316 (2010). Article  CAS  PubMed  Google Scholar  * Mandillo, S. _et al._ Reliability, robustness, and reproducibility in mouse behavioral phenotyping: A cross-laboratory


study. _Physiol. Genom._ 34, 243–255. https://doi.org/10.1152/physiolgenomics.90207.2008 (2008). Article  Google Scholar  * Brooks, S. P., Pask, T., Jones, L. & Dunnett, S. B.


Behavioural profiles of inbred mouse strains used as transgenic backgrounds II: Cognitive tests. _Genes Brain Behav._ 4, 307–317. https://doi.org/10.1111/j.1601-183X.2004.00109.x (2005).


Article  CAS  PubMed  Google Scholar  * Podhorna, J. & Brown, R. E. Strain differences in activity and emotionality do not account for differences in learning and memory performance


between C57BL/6 and DBA/2 mice. _Genes Brain Behav._ 1, 96–110. https://doi.org/10.1034/j.1601-183X.2002.10205.x (2002). Article  CAS  PubMed  Google Scholar  * Kafkafi, N., Lahav, T. &


Benjamini, Y. What’s always wrong with my mouse. _Proceedings of Measuring Behavior 2014: 9__th_ _International Conference on Methods and Techniques in Behavioral Research_ (Wageningen, The


Netherlands, August 27-29, 2014) 107–109 (2014). * Pigliucci, M. _Phenotypic plasticity: Beyond nature and nurture_ (JHU Press, Baltimore, 2001). Google Scholar  * Voelkl, B. & Würbel,


H. Reproducibility crisis: Are we ignoring reaction norms?. _Trends Pharmacol. Sci._ 37, 509–510 (2016). Article  CAS  PubMed  Google Scholar  * Åhlgren, J. & Voikar, V. Experiments done


in Black-6 mice: What does it mean?. _Lab. Anim._ 48, 171. https://doi.org/10.1038/s41684-019-0288-8 (2019). Article  Google Scholar  * Bohlen, M. _et al._ Experimenter effects on


behavioral test scores of eight inbred mouse strains under the influence of ethanol. _Behav. Brain Res._ 272, 46–54. https://doi.org/10.1016/j.bbr.2014.06.017 (2014). Article  CAS  PubMed 


PubMed Central  Google Scholar  * Milcu, A. _et al._ Genotypic variability enhances the reproducibility of an ecological study. _Nat. Ecol. Evol._ 2, 279–287 (2018). Article  PubMed  Google


Scholar  * Karp, N. A., Melvin, D., Mouse, S., Project, G. & Mott, R. F. Robust and sensitive analysis of mouse knockout phenotypes. _PLoS ONE_ 7, e52410 (2012). Article  ADS  CAS 


Google Scholar  * Krakenberg, V. _et al._ Technology or ecology ? New tools to assess cognitive judgement bias in mice. _Behav. Brain Res._ 362, 279–287 (2019). Article  PubMed  Google


Scholar  * Beynen, A. C., Gärtner, K. & Van Zutphen, L. F. M. Standardization of animal experimentation. _Princ. Lab. Anim. Sci. A Contrib. to Hum. Use Care Anim. to Qual. Exp. Results.


2nd edn. Amsterdam Elsevier_ 103–110 (2001). * Festing, M. F. W. Refinement and reduction through the control of variation. _Altern. Lab. Anim._ 32, 259–263 (2004). Article  CAS  PubMed 


Google Scholar  * Festing, M. F. W. Randomized block experimental designs can increase the power and reproducibility of laboratory animal experiments. _ILAR J._ 55, 472–476 (2014). Article 


CAS  PubMed  Google Scholar  * Karp, N. A. _et al._ A multi-batch design to deliver robust estimates of efficacy and reduce animal use—a syngeneic tumour case study. _Sci. Rep._ 10, 1–10.


https://doi.org/10.1038/s41598-020-62509-7 (2020). Article  CAS  Google Scholar  * Russell, W. M. S., Burch, R. L. & Hume, C. W. The principles of humane experimental technique. _Methuen


London_ 238, 64 (1959). Google Scholar  * Würbel, H. Focus on reproducibility more than 3Rs: The importance of scientific validity for harm-benefit analysis of animal research Focus on


Reproducibility. _Nat. Publ. Gr._ 46, 164–166 (2017). Google Scholar  * Kappel, S., Hawkins, P. & Mendl, M. T. To group or not to group? Good practice for housing male laboratory mice.


_Animals_ 7, 88 (2017). Article  PubMed Central  Google Scholar  * Melotti, L. _et al._ Can live with ‘em, can live without ‘em: Pair housed male C57BL/6J mice show low aggression and


increasing sociopositive interactions with age, but can adapt to single housing if separated. _Appl. Anim. Behav. Sci._ 214, 79–88 (2019). Article  Google Scholar  * Lister, R. G. The use of


a plus-maze to measure anxiety in the mouse. _Psychopharmacology_ 92, 180–185 (1987). CAS  PubMed  Google Scholar  * Crawley, J. N. Exploratory behavior models of anxiety in mice.


_Neurosci. Biobehav. Rev._ 9, 37–44 (1985). Article  CAS  PubMed  Google Scholar  * Fuss, J. _et al._ Are you real ? Visual simulation of social housing by mirror image stimulation in single


housed mice. _Behav. Brain Res._ 243, 191–198 (2013). Article  PubMed  Google Scholar  * Chourbaji, S. _et al._ Nature vs nurture: Can enrichment rescue the behavioural phenotype of BDNF


heterozygous mice?. _Behav. Brain Res._ 192, 254–258 (2008). Article  CAS  PubMed  Google Scholar  * O’Connor, A. M., Burton, T. J., Leamey, C. A. & Sawatari, A. The use of the puzzle


box as a means of assessing the efficacy of environmental enrichment. _JoVE J. Vis. Exp. _94, e52225 (2014). Google Scholar  * Touma, C., Sachser, N., Erich, M. & Palme, R. Effects of


sex and time of day on metabolism and excretion of corticosterone in urine and feces of mice. _Gen. Comp. Endocrinol._ 130, 267–278 (2003). Article  CAS  PubMed  Google Scholar  * Touma, C.,


Palme, R. & Sachser, N. Analyzing corticosterone metabolites in fecal samples of mice: A noninvasive technique to monitor stress hormones. _Horm. Behav._ 45, 10–22 (2004). Article  CAS


  PubMed  Google Scholar  * Auer, K. E. _et al._ Measurement of fecal testosterone metabolites in mice: Replacement of invasive techniques. _Animals_ 10, 1–17 (2020). Article  Google Scholar


  * Strekalova, T., Spanagel, R., Bartsch, D., Henn, F. A. & Gass, P. Stress-induced anhedonia in mice is associated with deficits in forced swimming and exploration.


_Neuropsychopharmacology_ 29, 2007–2017. https://doi.org/10.1038/sj.npp.1300532 (2017). Article  Google Scholar  * Deacon, R. M. J. Assessing nest building in mice. _Nat. Protoc._ 1,


1117–1119 (2006). Article  PubMed  Google Scholar  * Viechtbauer, W. Conducting meta-analyses in R with the metafor package. _J. Stat. Softw._ 36, 1–48 (2010). Article  Google Scholar  *


Lenth, R. & Lenth, M. R. Package ‘lsmeans’. _Am. Stat._ 34, 216–221 (2018). Google Scholar  * R Core Team. R: A Language and Environment for Statistical Computing. Download references


ACKNOWLEDGEMENTS The authors thank Viktoria Krakenberg for excellent help with the experiments and very useful comments on revising the manuscript, as well as Karina Handen, Edith Ossendorf


and Edith Klobetz-Rassam for excellent technical assistance. FUNDING Open Access funding enabled and organized by Projekt DEAL. This work was supported by a grant from the German Research


Foundation (DFG) to S.H.R. (RI 2488/3-1). AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Behavioural Biology, University of Münster, Badestraße 13, Münster, Germany Vanessa


Tabea von Kortzfleisch, Sylvia Kaiser, Norbert Sachser & S. Helene Richter * Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Münster, Münster, Germany


Vanessa Tabea von Kortzfleisch, Sylvia Kaiser, Norbert Sachser & S. Helene Richter * Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK


Natasha A. Karp * Department of Biomedical Sciences, University of Veterinary Medicine, Vienna, Austria Rupert Palme Authors * Vanessa Tabea von Kortzfleisch View author publications You can


also search for this author inPubMed Google Scholar * Natasha A. Karp View author publications You can also search for this author inPubMed Google Scholar * Rupert Palme View author


publications You can also search for this author inPubMed Google Scholar * Sylvia Kaiser View author publications You can also search for this author inPubMed Google Scholar * Norbert


Sachser View author publications You can also search for this author inPubMed Google Scholar * S. Helene Richter View author publications You can also search for this author inPubMed Google


Scholar CONTRIBUTIONS S.H.R. acquired the funding of the study. S.H.R., N.S., and S.K. conceived the study. S.H.R., N.S., S.K., and V.K. designed the experiments. V.K. carried out the


experiments. R.P. determined the hormonal data. V.K. conducted the statistical analysis of the data with help of N.K. S.H.R. supervised the project. V.K. visualised the data and wrote the


initial draft of the manuscript. S.H.R. edited the initial draft, and N.S., S.K., N.K. and R.P. revised the manuscript critically for important intellectual content. CORRESPONDING AUTHORS


Correspondence to Vanessa Tabea von Kortzfleisch or S. Helene Richter. ETHICS DECLARATIONS COMPETING INTERESTS VK, RP, SK, NS and SR declare to have no competing interests. NK is an employee


of AstraZeneca. NK has no conflicts of interest with the subject matter or materials discussed in this manuscript to declare. ADDITIONAL INFORMATION PUBLISHER'S NOTE Springer Nature


remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION. RIGHTS AND PERMISSIONS OPEN ACCESS


This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as


long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third


party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the


article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the


copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE von Kortzfleisch, V.T.,


Karp, N.A., Palme, R. _et al._ Improving reproducibility in animal research by splitting the study population into several ‘mini-experiments’. _Sci Rep_ 10, 16579 (2020).


https://doi.org/10.1038/s41598-020-73503-4 Download citation * Received: 07 April 2020 * Accepted: 11 September 2020 * Published: 06 October 2020 * DOI:


https://doi.org/10.1038/s41598-020-73503-4 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not


currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative