The Data Lakehouse: How We Stopped Copying Data Between Systems

For a long time, enterprise data architecture was built around a clean separation between two very different ideas. Bill Inmon gave us the data warehouse as a "subject-oriented, integrated, time-variant, and non-volatile" store of structured data, purpose-built for analytics. Ralph Kimball took a more pragmatic angle with dimensional modeling, building warehouses around the business processes that mattered. Both approaches required careful schema design upfront, ETL pipelines that moved and transformed data before it could be queried, and expensive specialized hardware from vendors like Teradata. The trade was strict governance and fast query performance for money and inflexibility.

The data lake was the reaction to that inflexibility. Cheap cloud object storage (S3, Azure Blob, GCS) made it possible to just dump everything in one place and figure out the schema later. "Schema-on-read" was the rallying cry: you ingest first and impose structure when you query. Hadoop and later Spark made it possible to run computation directly against raw files sitting in object storage. This was genuinely useful. You could store clickstream logs, sensor data, and raw JSON from APIs without designing a schema upfront. The cost per terabyte dropped dramatically. For a few years, everyone was building data lakes.

Then the data lakes started turning into data swamps.

Without transaction guarantees, a data lake has no concept of atomicity. If a write job fails halfway through, you now have partial data sitting in the lake, and nothing stops the next query from reading that partial data. Without ACID (Atomicity, Consistency, Isolation, Durability) semantics, running a query while data is being loaded gives you inconsistent results. And without schema enforcement, the lake fills up with files in incompatible formats, column names that shifted over time, and data no one can explain. The governance assumption built into the data lake model was essentially "someone will clean this up later," and that assumption turned out to be optimistic.

The lakehouse emerged as an attempt to get the benefits of both approaches without giving up either. Databricks coined the term around 2020, and the architecture they described is now a real design pattern that a number of organizations have adopted. The core idea is to add a metadata and transaction layer on top of object storage, so that the same files sitting in your cheap cloud storage can be treated with warehouse-grade consistency and governance. You keep the cost and flexibility of the lake. You add the reliability and governance that the warehouse promised.

The technology that makes this work is a set of open table formats. Delta Lake came from Databricks. Apache Iceberg came primarily from Netflix, and is now heavily supported by Apple, Netflix, and others. Apache Hudi came from Uber. All three are trying to solve the same core problem: how do you give ACID transaction semantics to files sitting in object storage? They do it by maintaining a transaction log alongside the actual data files. When a write operation runs, the table format records the operation in the log before the files are actually committed. If the write fails, the log lets you roll back. If two queries run concurrently, the log serializes them. The underlying storage is still cheap commodity object storage. The behavior looks like a proper database.

This matters for enterprise IS in a specific and practical way. The old pattern for most organizations was to run an ETL pipeline that moved data from operational systems into a staging area, transformed it, and then loaded it into a warehouse for reporting. That pipeline was slow (usually nightly), expensive (every transformation cost compute), and brittle (schema changes upstream broke everything downstream). More problematically, you ended up with multiple copies of the same data: one in the operational system, one in the staging area, one in the warehouse, and often another copy in a departmental reporting database or a BI tool's local cache. Every copy was a potential point of divergence and a governance liability.

The lakehouse, at least in theory, removes the need for some of those copies. If your operational data can land in an open table format directly, and your analytics tools can query it there with transactional consistency, you do not need a separate warehouse copy. You query the source. The "single source of truth" that every enterprise data strategy document promises but never quite delivers becomes structurally achievable rather than just aspirational.

I say "at least in theory" because the practice is more complicated. Getting operational systems to write directly to a lakehouse requires changes to those systems or at minimum to the pipelines that feed them. The legacy ERP running your financial close is not going to start writing Iceberg files on its own. The integration layer does not disappear. What changes is that once data is in the lake in an open format, you can query it, transform it, and serve it to different tools without copying it again.

The governance challenge is also real and often underappreciated. The open table formats solve the technical problem of transactional consistency. They do not solve the organizational problem of who is allowed to access which data, what data means, who is responsible for its quality, and who gets notified when it changes. Those questions require people and process, not just technology. A lakehouse without data governance practices is just a better-organized swamp. The data is accessible and consistent, but nobody knows what it means or whether they can trust it.

This is part of why data mesh has gotten attention as an organizational complement to the lakehouse technical pattern. Data mesh, associated with Zhamak Dehghani's writing on the subject, argues that data should be owned and served by the domain teams that produce it, not centralized in a single platform team. Each domain treats its data as a product with defined consumers, documented semantics, and a quality SLA. The lakehouse is the plumbing. Data mesh is the organizational design that prevents the plumbing from filling up with garbage.

Whether data mesh actually works at scale in real enterprises is still an open question. It requires a level of domain maturity and cross-team coordination that most organizations struggle with even without adding new data responsibilities. The technology is the easy part. Getting the finance team and the operations team and the customer team to each maintain a well-governed data product with clear ownership and reliable freshness guarantees is a political and organizational problem. The lakehouse architecture makes it technically feasible. Whether organizations can actually execute on it depends on things that no open table format can fix.