Data Warehouse Design Principles

Building a data warehouse that scales with your business.

12 min read Architecture Guide
Kasun Wijayamanna
Kasun WijayamannaFounder, AI Developer - HELLO PEOPLE | HDR Post Grad Student (Research Interests - AI & RAG) - Curtin University
Data warehouse architecture and storage system design

A data warehouse consolidates data from multiple sources into a single repository optimised for analytical queries. Unlike operational databases designed for transactions, data warehouses are designed for fast, complex queries across large datasets. Good design determines whether your warehouse delivers insights or becomes an expensive maintenance burden.

Design Fundamentals

Separate from Operational Systems

Analytical queries are resource-intensive. Running them against operational databases slows transactions and risks performance issues. A data warehouse provides a dedicated environment for analytics without impacting operations.

Historical Data

Operational systems often store only current state. Data warehouses preserve history—enabling trend analysis, year-over-year comparisons, and understanding how things changed over time.

Integrated View

Data from different sources is cleaned, standardised, and integrated. Customer IDs map across systems. Dates are consistent. Categories are aligned. One version of truth.

Dimensional Modelling

Dimensional modelling organises data for intuitive analysis. The two key concepts are facts and dimensions.

Fact Tables

Facts contain measurable, quantitative data—sales amounts, quantities, durations. Each row represents an event or transaction. Fact tables are typically large and grow over time.

Dimension Tables

Dimensions provide context for facts—the who, what, where, when of each event. Customer, product, date, location. Dimensions contain descriptive attributes used for filtering and grouping.

Star Schema

A fact table surrounded by dimension tables, joined by foreign keys. Simple, intuitive, and query-efficient. The most common dimensional design pattern.

Snowflake Schema

Dimensions are normalised into multiple related tables. Reduces data redundancy but increases query complexity. Generally, star schema is preferred unless storage is severely constrained.

Design for querying: Dimensional models optimise for how users query data, not for how applications write it. This is the opposite of transactional database design.

Modern Data Warehouse Architecture

Cloud Data Warehouses

Snowflake: Separation of compute and storage, pay-per-use pricing, strong data sharing capabilities.

BigQuery: Google's serverless data warehouse, excellent for large-scale analytics, tight integration with Google Cloud.

Redshift: AWS's data warehouse, mature and widely used, good integration with AWS ecosystem.

Databricks: Data lakehouse approach, combines warehouse and lake capabilities, strong for ML workloads.

Data Lakehouse

A newer pattern combining data lake flexibility (store any data, any format) with data warehouse structure (schema, ACID transactions, performance). Platforms like Databricks and Delta Lake enable this approach.

Architecture Trade-offs

  • Traditional warehouse: Structured, governed, performant—but inflexible
  • Data lake: Flexible, cheap storage—but ungoverned and slow
  • Lakehouse: Aims for best of both—complexity is still maturing

ETL vs. ELT

ETL (Extract, Transform, Load)

Traditional approach: extract data from sources, transform it in a staging area, then load the transformed data into the warehouse. Transformation happens before data enters the warehouse.

ELT (Extract, Load, Transform)

Modern approach: load raw data into the warehouse first, then transform it using the warehouse's processing power. Works well with cloud warehouses that provide elastic compute.

Modern Data Stack

The current generation of data tools often uses ELT with specialised components:

  • Fivetran, Airbyte: Data extraction and loading
  • dbt (data build tool): Transformation using SQL in the warehouse
  • Snowflake, BigQuery: Cloud data warehouse
  • Looker, Power BI: Visualisation and reporting

Design Best Practices

Source of Truth

Define which source is authoritative for each data element. When sources conflict, the warehouse applies rules to determine truth. Document these decisions.

Slowly Changing Dimensions

When dimension attributes change (customer moves address, product changes category), preserve history. Type 2 slowly changing dimensions maintain historical versions, enabling accurate point-in-time analysis.

Conformed Dimensions

Shared dimensions used consistently across fact tables. The same customer dimension used for sales, support, and marketing ensures consistent analysis across the organisation.

Incremental Loading

Only load changed data rather than reloading everything. Improves performance and reduces costs. Requires tracking changes in source systems (change data capture).

Data Quality Checks

Build data quality checks into the pipeline. Validate row counts, check for nulls, verify referential integrity. Catch problems before bad data reaches reports.

Summary

Data warehouse design balances analytical performance, ease of use, and maintainability. Dimensional modelling provides intuitive structures for business users. Modern cloud warehouses offer elastic scalability and simplified operations.

Start with clear business requirements—what questions need answering? Design dimensional models around business processes. Build incrementally, adding sources and subject areas over time. Invest in data quality from day one—garbage in, garbage out.