Multi-Cloud Strategy: Architecture or Accident?

I have heard "we have a multi-cloud strategy" in enough enterprise IT conversations that I started asking what people mean by it. The answer is almost always some variation of: we use AWS for our production systems, Azure because we already pay for Microsoft licensing, and Google Cloud because the data science team set something up a few years ago and now it is too embedded to move. That is not a strategy. That is an accident that acquired a name after the fact.

Gartner has noted that multi-cloud has become the default enterprise posture, with most large organizations using more than one cloud provider (see Gartner newsroom for current research, and I hedge any specific figures because I am working from public reporting rather than a verified report in front of me). The observation is directionally accurate. Multi-cloud is indeed the norm in large enterprises. What is much less common is multi-cloud by design, where workload placement across providers reflects deliberate decisions about latency, cost, regulatory compliance, redundancy, or data residency.

The distinction matters because accidental multi-cloud produces a specific set of problems that intentional multi-cloud is supposed to solve, or at least manage. When AWS gets added because one product team chose it, Azure gets added because corporate IT uses Microsoft already, and GCP gets added because data engineers prefer BigQuery, you end up with three separate cost management consoles, three separate identity and access management models, three separate security monitoring setups, and three sets of skills that do not fully transfer between providers. Your engineers know AWS or they know Azure. Knowing both well enough to design and troubleshoot production systems on each is a different and rarer skill set. The skill fragmentation compounds over time, especially in organizations where hiring is done by product team rather than centrally.

Data egress costs are the hidden tax on multi-cloud architectures, whether accidental or intentional. Cloud providers charge you to move data out of their environment. Inbound is typically free. Outbound costs money per gigabyte, and the costs compound at scale. An architecture that generates substantial inter-cloud data movement, like streaming analytics events from AWS to BigQuery on GCP for data science processing, can produce egress costs that dwarf the compute costs. Organizations that did not model egress costs when designing multi-cloud data pipelines often discover them on the bill and then face the choice between continuing to pay them or undertaking a significant repatriation project. I wrote about how egress costs work as a lock-in mechanism in the context of cloud migration, and the multi-cloud case is structurally the same. Each cloud provider wants to be the place where your data lives, because data residency creates the most durable form of switching cost.

The security surface expansion is the risk that I think gets the least rigorous attention in multi-cloud planning. Each additional cloud environment adds its own IAM model, its own network perimeter definitions, its own logging and monitoring tools, and its own API surface. Managing identity consistently across AWS IAM, Azure Active Directory (now Entra ID), and GCP IAM is not a trivial problem. They have different primitives, different concepts of roles and scopes, and different defaults for what is permitted when a permission is not explicitly configured. An organization that is sophisticated at cloud security on one platform is not automatically sophisticated on a second or third. And the threat surface in multi-cloud is not the sum of the threats on each platform individually. It is larger, because each cross-cloud boundary, each identity federation, each cross-cloud network connection, is a potential attack surface that would not exist in a single-cloud architecture. Zero trust security principles, which I explored in zero trust security architecture, become more demanding, not less, in a multi-cloud environment.

Kubernetes is often presented as the solution to the multi-cloud coherence problem, and it is genuinely useful for that purpose, with significant caveats. The idea is that Kubernetes provides a consistent abstraction layer for running containerized workloads regardless of which cloud is underneath. Deploy to AWS EKS, Azure AKS, or GCP GKE and the Kubernetes API is the same. In principle, a workload that runs on one should run on the others with minimal modification. In practice, the managed Kubernetes services from different providers are not identical. They have different defaults, different integration patterns with the rest of the provider's ecosystem, and different upgrade cadences. The Kubernetes API is consistent. The surrounding infrastructure is not. A production-grade Kubernetes deployment on AWS that uses AWS Load Balancer Controller, AWS Certificate Manager, and AWS Secrets Manager is not straightforwardly portable to GCP, because each of those integrations is specific to AWS. Making the application genuinely cloud-agnostic requires building additional abstraction layers, and those layers add complexity, maintenance burden, and their own failure modes.

"Cloud-agnostic" is one of those phrases that sounds like a sensible goal and reveals its complications only when you start implementing it. Every cloud-agnostic choice involves a trade-off with cloud-native choices. Using a managed database service from AWS means you get operational simplicity, automatic backups, and AWS-integrated monitoring in exchange for a dependency on AWS's specific database service. Using a cloud-agnostic database deployment, running PostgreSQL yourself on VMs or containers, gives you portability in exchange for the operational complexity of running the database yourself. For most workloads and most teams, the cloud-native choice is the right one. But making the cloud-native choice everywhere means the workload is deeply tied to one provider. The cloud-agnostic aspiration and the operational simplicity aspiration pull in opposite directions, and most engineering teams, under schedule and reliability pressure, choose operational simplicity, which means cloud-native, which means less portable.

The organizations I see get multi-cloud right are the ones that start with a specific problem statement. Not "we should be multi-cloud" but "we need to meet data residency requirements in the EU that one provider cannot cost-effectively satisfy" or "we need active-active redundancy across regions that extends across provider boundaries to survive a cloud provider outage." Those are problems that multi-cloud architecturally solves better than single-cloud. Starting with the architecture and working toward a problem statement tends to produce expensive coherence without clear value. Starting with the problem statement and working toward an architecture that happens to be multi-cloud tends to produce something that actually justifies the additional complexity.