Ever thought you were going to have an increased workload, only to have multiple events conspire to amplify it to obscene levels and turn it vaguely threatening?

Just curious.

But if there was one thing guaranteed to drag me back to blogging, it was p-values and the replication crisis. I’ve written about both extensively, but there’s more to add.

we observe that much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices, such as the one taken by the editors of Basic and Applied Social Psychology, who decided to ban p-values (null hypothesis significance testing) (Trafimow and Marks, 2015). Misunderstanding or misuse of statistical inference is only one cause of the “reproducibility crisis” (Peng, 2015), but to our community, it is an important one.
When the ASA Board decided to take up the challenge of developing a policy statement on p-values and statistical significance, it did so recognizing this was not a lightly taken step. The ASA has not previously taken positions on specific matters of statistical practice.
Said statement took many months to hash out (no wonder, as “Several expressed doubt about whether agreement could be reached…”), and involved many heated exchanges. The end result is pretty good, and it’s great to see a statistical body admit that “By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.” I think “P-values can indicate how incompatible the data are with a specified statistical model” is a bit misleading; rare events do happen, and by looking at only one hypothesis in isolation it cannot tell between a poor model or a fluke.
At any rate, if you thought I was being hyperbolic or doom-and-gloomy when I said it was time to panic, you might want to know that Everything is Crumbling.

We all have a limited supply of willpower, and it decreases with overuse. […] That simple idea—perhaps intuitive for nonscientists, but revolutionary in the field—turned into a research juggernaut. In the years that followed, Baumeister and Tice’s lab, as well as dozens of others, published scores of studies using similar procedures. […]

A paper now in press, and due to publish next month in the journal Perspectives on Psychological Science, describes a massive effort to reproduce the main effect that underlies this work. Comprising more than 2,000 subjects tested at two-dozen different labs on several continents, the study found exactly nothing. A zero-effect for ego depletion: No sign that the human will works as it’s been described, or that these hundreds of studies amount to very much at all.

This is the rumour I was talking about last time, becoming less so. While the Open Science Collaboration took aim at dozens or more hypotheses, this result poured a substantial effort into decade-old one that’s been backed up by hundreds of follow-up papers. It’s not quite as bad as discrediting Newtonian Mechanics, but it’s not far off.

“All of a sudden it felt like everything was crumbling,” says [Evan] Carter, now 31 years old and not yet in a tenure-track position. “I basically lost my compass. Normally I could say, all right there have been 100 published studies on this, so I can feel good about it, I can feel confident. And then that just went away.” […]

“At some point we have to start over and say, This is Year One,” says [l] Inzlicht, referring not just to the sum total of ego depletion research, but to how he sometimes feels about the entire field of social psychology.

All the old methods are in doubt. Even meta-analyses, which once were thought to yield a gold standard for evaluating bodies of research now seem somewhat worthless. “Meta-analyses are fucked,” Inzlicht warned me. If you analyze 200 lousy studies, you’ll get a lousy answer in the end. It’s garbage in, garbage out.

The situation has gotten so bad it’s threatening to spawn another crisis.

This stuff is all fairly complex, relying as it does on questions of statistical tomfoolery, hidden biases, and so on. And it got even more complex last week with the publication of a “technical comment” also in Science. In that article, a team led by Daniel Gilbert, a renowned psychologist at Harvard, argues that the OSC’s effort was itself riddled with methodological flaws. There’s no reason to be so pessimistic, in other words — there’s less evidence of a reproducibility crisis than Nosek and his colleagues would have you believe.

That article was published alongside a response from Nosek et al, which was itself followed, naturally, by a response to the response by Gilbert et al. Meanwhile, skin-in-the-game observers like the social psychologist Sanjay Srivastava and the statistician Andrew Gelman have written really useful blog posts for those hoping to dive deeper into this controversy.

Not surprisingly, a number of senior authors are arguing there isn’t actually a crisis after all. We’re in danger of entering a replication crisis crisis, where the debate over whether or not there is a replication crisis causes a crisis itself and divides the scientific community.

The back-and-forths between the Nosek and Gilbert camps aren’t over, and a lot of them are going to be really technically complex. It would be a shame for people to get dazed by all the numbers and claims and counterclaims being tossed about, and to lose sight of the underlying fact that the original Science article made a powerful argument that psychologists can and must do a better, more transparent job — and helped show them how to do so.

The debate, and the debate on the debate, carry serious consequences.

I feel like the ground is moving from underneath me and I no longer know what is real and what is not.

I edited an entire book on stereotype threat, I have signed my name to an amicus brief to the Supreme Court of the United States citing stereotype threat, yet now I am not as certain as I once was about the robustness of the effect. I feel like a traitor for having just written that; like, I’ve disrespected my parents, a no no according to Commandment number 5. But, a meta-analysis published just last year suggests that stereotype threat, at least for some populations and under some conditions, might not be so robust after all. P-curving some of the original papers is also not comforting. Now, stereotype threat is a politically charged topic and there is a lot of evidence supporting it. That said, I think a lot more pain-staking work needs to be done on basic replications, and until then, I would be lying if I said that doubts have not crept in. Rumor has it that a RRR of stereotype threat is in the works.

I’ve looked at that meta-analysis, and came away less gloomy than Inzlicht… but stay tuned. Everything seems to be crumbling.