Survey Response Scales and Why Your Five Points Might Not Mean What You Think

I sat through a dissertation proposal defense a while back where the committee asked one question that the student had no answer for: why five points and not seven? The student had built a twenty-item survey to measure three constructs. Every item used a five-point Strongly Disagree to Strongly Agree scale. The scales worked, statistically. Cronbach's alpha was fine. But no one had thought about why five. The choice was inherited from prior work in the same area, and prior work had probably inherited it from something older still.

That kind of unreflective inheritance is everywhere in IS survey research. The Likert-type scale is the default. It shows up in almost every positivist IS study that measures attitudes, perceptions, or intentions. And most papers treat the scale design as a detail, not a decision. I think this is a mistake.

Rensis Likert developed his original summated rating scales in 1932 as a way to measure attitudes by aggregating responses across multiple items. The key word is "summated." The idea was to combine many items that all point to the same underlying construct, so that idiosyncratic responses to individual items cancel out. A single item measuring "trust" is not a Likert scale in the original sense. It is a single rating. In IS research, these get treated as equivalent, which they are not. When Davis (1989) operationalized Perceived Usefulness in TAM using six items, each rated on a seven-point scale, the aggregation across six items gave the construct score some measurement depth. A one-item "how useful is this system" question measured on a five-point scale is a different animal, even if both are called Likert items.

The choice between five and seven points matters more than it seems. More response points give respondents more room to express gradations in their attitudes, which tends to increase variance in the data and can improve the discriminant validity of constructs that are otherwise hard to separate. But more points also increase cognitive load. When a respondent has to choose between "Agree" and "Strongly Agree," they are making a judgment about intensity that is genuinely difficult. Seven-point scales ask respondents to make finer distinctions than five-point scales, and whether respondents can reliably make those distinctions is an empirical question, not a given. In practice, many respondents compress a seven-point scale to three or four effective options anyway.

The deeper problem is the ordinal versus interval debate. Likert-type data is ordinal. We know that a response of 4 is more agreement than a response of 3, but we cannot assume that the psychological distance from 3 to 4 equals the psychological distance from 4 to 5. Those intervals are not guaranteed to be equal. This matters because the statistical methods most commonly used in IS survey research, including regression and structural equation modeling, treat the data as if the intervals are equal. They assume interval-level measurement. Using ordinal data in methods that require interval data is a known limitation, and the field has largely adopted a convention of treating it as interval while acknowledging that this is an assumption, not a fact. My study-hub notes from the day 2 validity section put it plainly: Likert scales are usually analyzed as interval data, but this is an analytical convention, not an inherent property of the data.

Acquiescence bias adds another layer of complexity, especially for IS research that involves international samples. Acquiescence is the tendency for some respondents to agree with survey items regardless of content. It affects all Likert-type scales, but the strength of the effect varies across cultures. Research in cross-cultural psychology has found that acquiescence bias is stronger in some cultural contexts than others, which means that if your IS study surveys users in multiple countries and reports country-level comparisons, scale-level acquiescence differences can masquerade as real construct differences. This is a problem I almost never see addressed in IS papers that use multi-country survey data.

The midpoint question is also worth thinking about. A five-point scale with a "Neither Agree nor Disagree" midpoint gives respondents an explicit neutral option. Some researchers remove the midpoint, forcing a directional response, on the theory that most attitudes are not truly neutral and that the midpoint is a refuge for low-effort respondents. Others keep it on the theory that forcing a direction when respondents are genuinely ambivalent introduces measurement error. The right answer depends on the construct. For attitudes that plausibly have neutral cases, removing the midpoint creates artificial directionality. For constructs where neutrality is theoretically impossible (you either use the system or you do not), a forced-choice design may be more honest.

Gartner's enterprise surveys face all the same problems and make the same kinds of design choices. When Gartner reports that a certain percentage of CIOs plan to increase technology investments in a given category, those percentages come from survey responses measured on some kind of rating or agreement scale. The methodological choices Gartner makes about scale length, labeling, and midpoints shape the statistics that practitioners then quote as if they were objective facts. You can browse the kinds of findings Gartner publishes at the Gartner newsroom, but the methodology behind those surveys is rarely visible. Academic IS surveys are at least expected to report their measurement decisions and face peer review on them.

When I think about what makes measurement decisions defensible, I come back to what Straub, Boudreau, and Gefen (2004) argued for IS specifically: that the credibility of empirical findings depends on transparent, consistent validation of the measurement instruments. Scale design is part of that. Choosing five points instead of seven, or removing the midpoint, or using single-item measures where multi-item scales exist, these are all decisions that should be justified in the paper, not buried in a footnote or, worse, not mentioned at all. The issue connects to the broader construct validity problem I wrote about in my post on common method bias in survey research, where the measurement method itself becomes a source of variance in the results.

The student at the proposal defense eventually landed on a reasonable answer for the committee: prior IS research in the same domain consistently used five-point scales, so comparability with existing studies justified the choice. That is a legitimate argument. It is not the best argument (comparability with a flawed convention is still a flawed convention), but it is an honest one. What I would have liked to hear is what the five-point scale would miss that a seven-point scale might capture, and whether that mattered for the construct in question.