When IS researchers compare survey results across groups, they assume the constructs mean the same thing to everyone. Most papers never test this. The test is called measurement invariance.
There is an assumption buried in almost every IS paper that compares groups, and almost nobody checks it explicitly. When you run a study that includes men and women, or experienced and novice users, or respondents from the US and China, and you compare their scores on constructs like "perceived usefulness" or "system trust," you are assuming that those constructs mean the same thing to both groups, and that the survey items work the same way for both groups. That assumption has a name: measurement invariance. And if it does not hold, your group comparisons may be meaningless.
Let me explain what it actually means for an item to "work the same way" across groups. In a reflective measurement model, each survey item is an indicator of the underlying construct. The item "Using this system increases my productivity" is supposed to reflect perceived usefulness. For a factor loading of, say, 0.78, we assume that a one-unit increase in the underlying construct produces a 0.78-unit increase in the observed item. Metric invariance means those factor loadings are the same across groups. If experienced users load the productivity item at 0.78 but novice users load it at 0.45, then the item has a different relationship to the construct depending on who is answering it. Comparing those two groups on "perceived usefulness" is not comparing the same thing.
Scalar invariance is even more demanding. It requires not just that the loadings are equal but that the item intercepts are equal across groups. The intercept is the baseline level of the item response when the underlying construct is at zero. If experienced users have a systematically higher baseline response to the productivity item, independent of their actual perceived usefulness, then even if the loadings match, you are still comparing apples to oranges. Mean comparisons across groups, the kind you do when you claim "experienced users report higher perceived usefulness than novices," require scalar invariance to be defensible.
Chen (2007) and Vandenberg and Lance (2000) are widely cited in the management and IS methods literature for laying out the sequence of tests and the criteria for what counts as invariance. The standard approach uses a series of increasingly constrained confirmatory factor analysis (CFA) models. You start with configural invariance, which tests whether the same factor structure (same items loading on the same factors) holds in both groups. Then you constrain the factor loadings to be equal across groups and test whether model fit degrades significantly. That is the metric invariance test. Then you constrain the item intercepts to be equal and test again. That is the scalar invariance test.
If you fail scalar invariance but have metric invariance, you are in a partial invariance situation. You can compare latent variances and covariances across groups, but you cannot compare latent means. A lot of moderation analyses in IS papers implicitly require full scalar invariance to be valid, and they do not test for it. The moderation just gets estimated and reported. The paper says "group membership moderates the effect of perceived ease of use on behavioral intention," and the reviewers do not ask whether the measurement of perceived ease of use functioned equivalently across groups in the first place.
I want to be concrete about why this matters more now than it might have twenty years ago. IS research has become genuinely global. It is normal to collect survey data in multiple countries and compare constructs across cultural contexts. The Technology Acceptance Model was originally developed and validated in North American samples. Constructs like "perceived usefulness" and "social influence" may not mean the same thing in collectivist cultures as they do in individualist ones. Not because the people are incapable of understanding the items, but because the items invoke different webs of meaning, different reference points, and different social contexts depending on who is reading them.
A study might find that users in China report higher social influence as a predictor of adoption than users in the US. That is a theoretically interesting finding if it means that social norms actually play a different causal role in the two contexts. But if the social influence items function differently in the two samples, with different loadings and different intercepts, then you cannot tell whether the observed difference is real or an artifact of measurement non-equivalence. You might be measuring the same latent variable with different precision in the two groups and calling the resulting difference a cross-cultural effect.
The practical barrier to doing this properly is that it requires a reasonably large sample in each group and some familiarity with CFA modeling. Multi-group CFA is not complicated once you have done it a few times, but it adds a step that many IS researchers skip because reviewers do not routinely ask for it. That is the real driver here. If the norm is to test and report measurement invariance, researchers will do it. If the norm is to collect data across groups and run a moderated regression without ever specifying your measurement model, the assumption stays invisible and unchecked.
There is also a reasonable question about what to do when you detect partial non-invariance. Full scalar invariance is often not achieved in real IS survey data, especially across cultural groups. The more interesting question is which items are non-invariant and whether that non-invariance is theoretically meaningful or a translation artifact. If one item in a five-item construct shows non-invariant intercepts across countries, you can still make limited comparisons using the invariant items. Partial invariance does not necessarily kill the comparison; it just means you need to be more careful about what you claim. Some researchers drop the non-invariant items and proceed with the remaining items. Whether that is methodologically appropriate depends on whether the dropped items were contributing to construct validity or whether they were outliers.
The broader point is that cross-group comparisons are not free. They carry an assumption, and the assumption is testable. IS research would be better served by making this test routine rather than exceptional.
About the author
Share
More notes
Related notes