Continuing on with the replication theme, I’ve heard there’s a potential solution to this.

If I had to advocate for a single change to practice, [internal replication] would be it. In my lab we never do just one study on a topic, unless there are major constraints of cost or scale that prohibit that second study. Because one study is never decisive.* Build your argument cumulatively, using the same paradigm, and include replications of the key effect along with negative controls. This cumulative construction provides “pre-registration” of analyses – you need to keep your analytic approach and exclusion criteria constant across studies. It also gives you a good intuitive estimate of the variability of your effect across samples. The only problem is that it’s harder and slower than running one-off studies. Tough. The resulting quality is orders of magnitude higher.

Unfortunately, the person who mentioned the idea to me only did so to poke holes in it.

If you do not know how to read these plots, don’t worry. Just focus on this key comparison. 17 internally replicated effects could also be replicated by an independent team (1 effect was below p = .055 and is not counted here). The number is exactly the same for effects which were not internally replicated. So, the picture couldn’t be any clearer: internal replications are not the solution to Psychology’s replication problem according to this data set.

(That dataset being the psychology study I discussed last time).

I believe that internal replications do not prevent many questionable research practices which lead to low replication rates, e.g., sampling until significant and selective effect reporting. To give you just one infamous example which was not part of this data set: in 2011 Daryl Bem showed his precognition effect 8 times. Even with 7 internal replications I still find it unlikely that people can truly feel future events. Instead I suspect that questionable research practices and pure chance are responsible for the results. Needless to say, independent research teams were unsuccessful in replication attempts of Bem’s psi effect (Ritchie et al., 2012; Galak et al., 2012). There are also formal statistical reasons which make papers with many internal replications even less believable than papers without internal replications (Schimmack, 2012).

Huh, Darryl Bem? Small world. But let’s focus on Schimmack’s study, as I think it deserves a wider airing (even if it does mistake p-values for type I errors on page 5. Not the same). Schimmack’s work mentions a number of problems with multi-study approaches: segmenting the sample set into multiple studies dilutes the statistical power and makes spurious results more likely; getting multiple positives nearly forces researchers to engage in questionable practices; non-positive results can be dropped under the excuse of “pilot studies,” “poor procedure,” or even “bad luck,” and still result in a publishable work. He goes further to provide a solid argument to be suspicious of most multi-study papers.

it is helpful to consider a concrete example. Imagine a multiple study article with 10 studies with an average observed effect size of d = .5 and 84 participants in each study (42 in two conditions, total N = 840) and all studies produced a significant result. At first sight these 10 studies seem to provide strong support against the null-hypothesis. However, a post-hoc power analysis with the average effect size of d = .5 as estimate of the true effect size reveals that each study had only 60% power to obtain a significant result. That is, even if the true effect size were d = .5, only 6 out of 10 studies should have produced a significant result. […]

From the perspective of binomial probability theory, the scenario is analogous to an urn problem with replacement with six green balls (significant) and four red balls (non-significant). The binomial probability to draw at least 1 red ball in 10 independent draws is 99.4%. … That is, 994 out of 1000 multiple study articles with 10 studies and 60% average power should have produced at least one non-significant result in one of the 10 studies. It is therefore incredible if an article reports 10 significant results because only 6 out of 1000 attempts would have produced this outcome simply due to chance alone.

I’ll give Schimmack the final word, here.

It is important that psychologists use the current crisis as an opportunity to fix problems in the way research is being conducted and reported. The proliferation of eye-catching claims based on biased or fake data can have severe negative consequences for a science. A recent New Yorker article (Lehrer, 2010) warned the public that “all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable.” If students who read psychology textbooks and the general public lose trust in the credibility of psychological science, psychology loses its relevance because objective empirical data are the only feature that distinguishes psychological science from other approaches to the understanding of human nature and behavior. It is therefore hard to exaggerate the seriousness of doubts about the credibility of research findings published in psychological journals.

[1] Schimmack, Ulrich. “The ironic effect of significant results on the credibility of multiple-study articles.Psychological Methods 17.4 (2012): 551.