The Replication Crisis and What It Means for IS Research

The Open Science Collaboration (2015) is widely cited as documenting that a substantial portion of psychology findings did not replicate when independent teams tried to reproduce them. The exact percentages vary depending on how you define replication success, whether you use effect size similarity, significance in the same direction, or some other threshold. But the directional finding is clear and widely accepted: a lot of what got published did not hold up when tested again in different samples and different labs.

I want to talk about what that means for IS research. Because the first reaction in our field was more or less "that's a psychology problem." It is not.

The mechanisms that produced the replication crisis in psychology are structural, and IS research shares all of them. Publication bias is the one most people know: journals historically preferred studies with statistically significant results. This means that if I ran the same study five times and got one significant result, that result would get published and the other four would sit in a file drawer somewhere. The published literature then shows a "real" effect that was really just noise. Small samples amplify this problem. If you have 80 participants in a student sample, and your effect is real but small, your power to detect it is low. When you do find significance, you might be picking up on sampling variability rather than a true relationship.

Then there is p-hacking, which is the practice of running multiple analyses on the same data until something hits the p < .05 threshold. This can happen consciously or unconsciously. You collect data on ten variables, run all pairwise correlations, find three that are significant, build a story around those three, and submit the paper. The story sounds coherent after the fact. There is no way for reviewers to know that you started with a different theoretical model and pivoted when the data did not cooperate.

IS research does all of this. The field runs heavily on Likert-scale surveys, often with student samples, often in a single organization, collected at a single point in time. I have reviewed papers where the entire sample was MBA students at one university using a newly launched course management system. The authors found the hypothesized relationships, published, and cited the limitations as a paragraph at the end. Then other papers built on that one. If the original result was not real, the whole chain of work downstream is building on a fragile foundation.

The student sample issue is worth dwelling on. There is a real logic to using students: they are relatively homogeneous, which reduces noise, and they are available. In some domains, like studying how people learn with educational technology, students are the right population. But a lot of IS research uses students as a convenient stand-in for "users in general," and that assumption is almost never tested. Students are younger, more educated than average, more comfortable with technology in specific ways, and responding to survey instruments in a context where they want to appear competent. That is a very specific and strange population to generalize from.

The single-organization case study creates a different but related problem. Case studies are valuable and I am not criticizing them here. But a case study of one organization implementing ERP in 2018 may not tell you much about ERP implementations in general. The organization's history, industry, leadership, and a dozen other factors shape the outcome. When the case produces a set of "factors for success," those factors are specific to that case. Turning them into general theory requires more work than most papers do.

What is notable about the replication crisis is that it was not primarily about fraud. Most of the original studies in psychology were conducted honestly. The researchers believed their hypotheses, ran legitimate statistical tests, and reported what they found. The problem was structural: the incentive system rewards novelty and significance, small samples are treated as adequate, and the norms of the field did not require pre-registering hypotheses or sharing data. When all of those things are true at once, even honest researchers produce unreliable results at scale.

IS has made some moves toward addressing this. Open data and materials sharing are slowly growing. Some journals are accepting registered reports, where you pre-register your hypotheses and methods before collecting data, and the journal commits to publish regardless of the outcome. Pre-registration is the most direct fix for p-hacking and outcome-switching, because the hypotheses are locked in before anyone sees the results.

But the field has a long way to go. Most IS journals still do not require data sharing. Replication studies are rare and not particularly valued. When I look at the citation patterns in high-impact IS journals, I see a lot of papers extending existing models with new moderators or new contexts, but very few papers asking whether the original models held up in new samples. The literature accumulates but does not self-correct the way it should.

I think the harder conversation is about what we treat as a contribution. A study that finds a new moderator of technology acceptance, using a student sample in a single country, with no replication check, might be publishable. A replication study that tests whether an established finding generalizes to a different population might be harder to place, even though the replication study is arguably more useful for building cumulative knowledge. That incentive structure is the real problem. And changing it requires journal editors, program chairs, and tenure committees to decide that reproducibility is a scholarly value, not just a methodological nicety.

The psychology replication crisis was uncomfortable for that field. The equivalent reckoning in IS research has not really happened yet. That does not mean the field is immune. It means the conversation has not started in earnest.