Data Warehouse Design Principles

Design fundamentals

A data warehouse consolidates data from multiple sources into a single repository optimised for analytical queries. Unlike your operational databases (designed for fast transactions), a warehouse is designed for complex queries across large datasets. The design determines whether it delivers genuine insights or becomes an expensive maintenance burden.

Separate from operational systems

Analytical queries are resource-hungry. Running a report that scans millions of rows against your production database slows down the transactions your customers and staff depend on. A data warehouse gives analytics its own environment. Your reports don't compete with your sales system for resources.

Historical data

Operational systems typically store current state: a customer's address now, the current price, the latest inventory count. Data warehouses preserve history. That's what enables trend analysis, year-over-year comparisons, and understanding how things changed over time. History is where most of the analytical value lives.

Integrated view

Data from different sources gets cleaned, standardised, and reconciled. Customer IDs map consistently across systems. Dates use the same format. Product categories align. You end up with one version of truth rather than three different spreadsheets that each tell a different story.

Dimensional modelling

Dimensional modelling organises warehouse data in a way that's intuitive for humans to analyse. Two key concepts: facts and dimensions.

Fact tables

Facts contain the measurable, quantitative data: sales amounts, order quantities, call durations, click counts. Each row represents an event or transaction. Fact tables tend to be large and grow continuously over time.

Dimension tables

Dimensions provide the context for facts: the who, what, where, and when of each event. Customer dimension, product dimension, date dimension, location dimension. They contain descriptive attributes used for filtering and grouping: "Show me sales by region for Q3" uses the location and date dimensions.

Star schema

The most common dimensional pattern. A central fact table surrounded by dimension tables, connected by foreign keys. It's called a star because the diagram looks like one. Star schemas are simple, intuitive, and efficient for analytical queries.

Snowflake schema

Dimensions are normalised into multiple related tables (a product dimension split into product, subcategory, and category tables). Reduces data redundancy but increases query complexity. Generally star schema is preferred unless you have a compelling reason to normalise. The storage savings rarely justify the added joins.

Design for querying, not writing. Dimensional models optimise for how business users query data: "Show me total sales by product category by month." This is the opposite of transactional database design which optimises for writes and data integrity.

Modern architecture

Cloud data warehouses

Platform	Key strength	Best fit
Snowflake	Separation of compute and storage, pay-per-query	Multi-cloud, data sharing, variable workloads
BigQuery	Serverless, excellent for large-scale analytics	Google Cloud shops, heavy analytics
Redshift	Mature, deep AWS integration	AWS-centric organisations
Databricks	Lakehouse approach, strong ML integration	Teams that need both analytics and ML

The data lakehouse pattern

A newer approach combining data lake flexibility (store any data in any format, cheaply) with data warehouse structure (schema enforcement, ACID transactions, predictable performance). Platforms like Databricks and Delta Lake enable this pattern. It's compelling but still maturing. The tooling improves rapidly.

ETL vs ELT

ETL (Extract, Transform, Load)

The traditional approach. Data is extracted from sources, transformed in a separate staging area (cleaned, joined, reformatted), then loaded into the warehouse in its final form. Transformation happens before data enters the warehouse.

ELT (Extract, Load, Transform)

The modern approach. Raw data is loaded directly into the warehouse first, then transformed using the warehouse's own processing power. This works well with cloud warehouses that provide elastic compute. You can throw more resources at the transformation without managing infrastructure.

The modern data stack

The current generation of data tooling typically uses ELT with specialised components:

Fivetran, Airbyte: extract and load data from hundreds of sources
dbt (data build tool): SQL-based transformations that run inside the warehouse
Snowflake, BigQuery: cloud warehouse with elastic compute
Looker, Power BI: visualisation and reporting

This stack is modular, each component does one thing well, and you can swap components without rebuilding everything. It's become the de facto standard for mid-market analytics.

Best practices

Source of truth

Define which source system is authoritative for each data element. When sources disagree (and they will), the warehouse applies predefined rules to determine truth. Document these decisions. Future you will thank present you.

Slowly changing dimensions

When a customer moves address or a product changes category, you have a choice: overwrite the old value (Type 1) or keep a historical record (Type 2). Type 2 slowly changing dimensions maintain previous versions, enabling accurate point-in-time analysis. "What region was this customer in when they placed this order?" is impossible without historical dimensions.

Conformed dimensions

Use shared dimensions consistently across all fact tables. The same customer dimension referenced by sales facts, support facts, and marketing facts ensures cross-functional analysis is consistent. If sales and support use different customer definitions, comparing them is meaningless.

Incremental loading

Only load data that has changed since the last run, rather than reloading everything. Massively improves performance and reduces cost in cloud warehouses where you pay for compute. Requires change data capture (CDC) or reliable timestamps in source systems.

Data quality checks

Build validation into the pipeline. Verify row counts match expectations. Check for unexpected nulls. Confirm referential integrity. Catch problems before bad data reaches dashboards. It's far easier to fix a pipeline issue than to restore trust after executives see wrong numbers.

Frequently asked questions

Do we need a data warehouse?

If you're a small business with one or two data sources and simple reporting needs, probably not. Power BI connected directly to your database may be sufficient. But once you have multiple data sources, need historical analysis, or your reports are slowing down production systems, a warehouse becomes valuable.

Cloud or on-premise?

Cloud for almost all new deployments. The operational simplicity, elastic scaling, and pay-per-use economics are compelling. On-premise only makes sense for organisations with strict data sovereignty requirements or existing significant on-premise investment.

How long does it take to build?

An initial data warehouse with 2–3 source systems and core dimensional models: 6–12 weeks. But it's fundamentally iterative. You add sources, build new models, and expand over time. Plan for a first deliverable in weeks, and ongoing development measured in quarters.

Key takeaways

Data warehouses separate analytics from operational systems. Your reports don't slow your transactions.
Dimensional modelling (facts and dimensions) makes data intuitive for business users to query and explore.
Modern cloud warehouses like Snowflake and BigQuery offer elastic compute. Pay for what you use.
Start with clear business questions. Design dimensional models around business processes, not source systems.

Postgraduate Researcher (AI & RAG), Curtin University - Western Australia

View profile →