If there’s anything more embarrassing than learning someone else couldn’t replicate your results, it must be failing to replicate your own.

The neurohormone Oxytocin (OT) has been one of the most studied peptides in behavioral sciences over the past two decades. Many studies have suggested that OT could increase trusting behaviors. A previous study, based on the “Envelope Task” paradigm, where trust is assessed by the degree of openness of an envelope containing participant’s confidential information, showed that OT increases trusting behavior and reported one of the most powerful effects of OT on a behavioral variable. In this paper we present two failed replications of this effect, despite sufficient power to replicate the original large effect.

If you’re expecting me to launch into a diatribe about poor study design, you’re going to be disappointed. These researchers used an appropriate number of samples to get the statistical power they needed. Indeed, they even mention power in their analysis, unlike most authors!

The first hypothesis suggests that OT’s effect on the ET paradigm exists and that the original study’s effect size is the “true” one. This hypothesis is however not likely as only 14 participants would have been needed to replicate such a large effect size (d = 2.29). Even when we take the lower bound of a large effect size regarding Cohen’s norms (i.e., d = 0.8), 84 participants should have been enough to replicate a large effect (we had 95 participants in Study 1). Therefore, we can exclude a priori the idea that our studies were underpowered to replicate a large effect.

And yes, they do use p-values, which I’m not a big fan of, and yes they confused Type I errors with p-values… but because the sample size is relatively small, and the original p-value is so huge, a Bayesian analysis of the original study wouldn’t come to a radically different conclusion. I quickly whipped up a naive one, and the odds ratio was so large I never bothered to count the decimal places. d = 2.29 is nearly unheard of, but I’d hate to see the prior that could nullify that odds ratio.

I don’t think the original study was a failure, though. And the researchers did the right thing by publishing this replication, instead of sitting on it. Indeed, despite the lousy statistical techniques on display, they’re much more on the ball than of their peers.

… “fishing” practices leading to post-hoc identification of potential moderators should be avoided. Indeed, multiple hypothesis testing might increase the false discovery rate. If nonetheless a research provides an interesting post-hoc moderating effect, it should also report the other moderators that were tested but which did not yield a significant effect, as this would greatly help to identify which variables do or do not moderate OT’s effects. The reviewing process could encourage such disclosure.

Meanwhile, OT literature should be more open to non-significant results and keep in mind that paradigms that have worked once might not always maintain their promise. Whenever possible, a replication of the results should be attempted before publication and in any case OT literature should promote failed replications.

So I come not to bury this paper, but praise it. This is more of what we need in science, if we’re to overcome the replication crisis. But would it kill these researchers to learn a bit of Bayesian stats? Oy.