AI Is Forcing IS Research Methods to Evolve. Here Is What Should Change.

I spent the last month digging into the Next Generation IS Methods literature for my comps preparation, and something kept bothering me. Blohm et al. (2025) argue that IS needs methods for complex and dynamic phenomena. Pieper et al. (2025) introduce micro-randomized trials for time-varying digital interventions. These are real contributions. But reading them against the standard IS method papers, Lee (1991), Klein and Myers (1999), Eisenhardt (1989), the gap is not just that the world has changed. The entire direction of causal inference has flipped. IS research methods were designed for a world where technology is deployed by organizations and adopted by users. The system is stable. The human is the adaptive system. AI turns that assumption around: the technology itself changes post-deployment. A model retrains, adapts, shifts its decision boundary. The treatment changes while you are still measuring its effect.

Consider a pre-post study of an AI clinical decision support system. You measure clinician decision quality before deployment, roll out the AI, and measure again after. If the AI model is static, the pre-post comparison works. If the AI model updates itself based on user interactions, as many real production systems do, the AI between pre and post is not the same AI. The model that made recommendations in week one learned from user feedback and changed its behavior by week twelve. What did you just measure? You measured the effect of a moving target. The pre-post design assumes a stable treatment. AI violates that assumption every day.

The same logic applies to A/B testing AI features. Suppose you randomize users into a control condition with standard AI recommendations and a treatment condition with explainable AI that shows why each recommendation was made. Halfway through the experiment, the AI in the treatment condition has adapted its explanations based on user engagement patterns. It is now generating different explanations for different user segments. Meanwhile, the control AI has also been learning from user behavior, just without explanations. What is the counterfactual now? The two conditions have diverged in ways that the original A/B design cannot account for, because the AI is not a fixed intervention. It is a co-evolving system.

This is where the standard validity frameworks break down in ways that matter. Cook and Campbell (1979) define internal validity as whether the covariation reflects a causal relationship, threatened by history, maturation, testing, instrumentation, selection, and regression. In an AI experiment, the instrumentation threat is not just measurement decay. The AI model itself changes, and that change is not a confound to be controlled. It is the phenomenon. The question is no longer Did the treatment cause the effect? but What kind of treatment was active at each point, and how did the evolving treatment interact with the evolving human response?

Bono and McNamara (2011) require three conditions for causal claims. Covariation, temporal precedence, and no plausible alternative explanations. When the AI changes during the study, temporal precedence becomes ambiguous. Does the human behavior change because of the AI recommendation received today, or because the AI adapted to yesterday's behavior and the human is responding to that adaptation? The effect direction is circular, and standard variance-based methods cannot distinguish the feedback loop.

The standard IS survey approach is even harder to defend. A cross-sectional survey asks users about their perceptions of an AI system at one point in time. But the system the user experienced last month may not be the same system today. If the AI updates its models regularly, then a survey administered at time T captures user perceptions of the system as it was at T, not of the system under study. The construct itself, the AI's behavior, is time-dependent. A static measurement of a dynamic treatment produces construct validity failure, not a measurement error you can fix with more items.

Pieper et al. (2025) offer micro-randomized trials as a concrete alternative. An MRT repeatedly randomizes treatment options over time, estimating time-varying and context-sensitive effects. This matches the AI reality: the intervention changes, the user state changes, the context changes, and the researcher needs a design that randomizes repeatedly instead of once. The special strength of MRT is that it separates time-varying treatment effects from confounding adaptation. You do not need to pretend the AI is stable. You model the change.

Blohm et al. (2025) go further. They argue that next generation IS methods need to handle complex and dynamic phenomena where feedback loops, user learning, and system updates create non-stationary conditions. This fits the AI research problem exactly. GenAI and agentic AI are not stable tools with one fixed effect. Their outputs vary by prompt, context, user knowledge, data, organizational rule, and time. A user learns how to work with the system. The system changes through updates. The organization adds controls. The platform changes governance rules. A one-time measurement misses all of this.

What does a useful AI method look like? I think it has to start with longitudinal logic. You need repeated observation of both the AI output and the human response over time, not a single measurement. If the theory is about changing treatment effects, a one-shot survey or a two-wave pre-post design cannot work. The method has to observe change. Second, you need rollback designs where the researcher can revert the AI to an earlier version to test whether the observed effect depends on the current state of the model. If improving the model also improves the human outcome, you cannot tell whether the model change caused the improvement or whether the human learned independently. Rolling back the model to its previous state and watching the effect reverse is stronger evidence. Third, you need joint measurement of human behavior and AI behavior as co-evolving variables. Most IS studies measure the human and treat the system as a black box condition. In AI research, the system state is a variable. Logging the model's predictions alongside the human's response, and modeling how each shapes the other over time, is the design implication.

This is where I depart from most of the IS method literature that I have read. The paradigm that still dominates IS, as Orlikowski and Baroudi (1991) documented, treats the technology as a stable independent variable and the human as the only dynamic system. AI flips that assumption, and our methods have not caught up. The technology changed. The methods did not.

A field experiment comparing AI adoption across teams needs to track the AI's decision boundary shifts as the system learns from early adopters. A longitudinal survey measuring delegation willingness needs to capture how the AI's behavior changes between measurement waves. A process study of GenAI platform governance needs to follow how the platform's AI policies evolve as complementor behavior shifts in response to algorithmic governance. As I wrote about mixed methods research, integration across qualitative and quantitative strands is harder but more necessary when the phenomenon itself is changing under observation.

Critical realism becomes more useful here than I previously recognized. As I argued in my post on critical realism in IS research, Wynn and Williams (2012) specify that observed regularities may not reveal underlying causal mechanisms, because mechanisms can be activated, suppressed, or counteracted by contextual conditions in open systems. For AI, the mechanism is not just the model. It is the interaction between the model's learning algorithm, the human's adaptation, and the organizational rules that govern both. The stratified ontology of critical realism, the real (mechanisms like feedback loops and learning rates), the actual (events like delegation decisions), and the empirical (data you collect from logs and surveys), fits the AI reality better than the positivist assumption that the treatment is stable and the relationship is linear.

Pieper et al. (2025) and Blohm et al. (2025) give the design direction. What is missing is the paradigm shift that makes those designs feel natural rather than exotic. The standard IS method training still teaches pre-post experiments, cross-sectional surveys, and A/B tests as the default tools for studying technology effects. Those are 1990s tools for 1990s technology. AI is not a stable treatment. It is an adaptive co-actor. If the method does not treat it as one, the method is measuring the wrong thing.

I am not sure the IS field is ready for this shift. Survey-based research is deeply institutionalized. The SEM tradition, as Kline (2016) presents it, assumes a stable theoretical model that the data either confirm or disconfirm. That logic breaks when the model itself drifts. But the alternative is to keep running pre-post studies on systems that changed between pre and post, publishing significant results that look good in the literature but tell us nothing about how AI actually behaves in organizations. That is not progress. That is a method that has become a ritual.

The path forward is not to abandon the IS method toolkit. It is to add new tools that match the new phenomenon. Longitudinal panel designs with repeated measurement of both human and AI states. Micro-randomized trials that randomize treatment repeatedly instead of once. Rollback experiments that revert the AI to test for state-dependent effects. Process studies that trace how AI and human co-evolve through feedback loops. The method has to fit the mechanism. If the mechanism involves adaptation on both sides, the method has to observe both sides adapting.