Data quality is the foundation of all analytics. Bad data leads to bad decisions—a report showing the wrong revenue figures, customer segments built on incorrect data, AI models trained on garbage. Most organisations underestimate their data quality problems until a highly visible failure exposes the issue.
The Cost of Poor Data Quality
Impact of poor data quality:
- Bad decisions: Strategies built on incorrect information
- Lost productivity: Staff spend time finding and fixing data instead of analysing it
- Eroded trust: Once people distrust the data, they stop using analytics
- Customer impact: Wrong addresses, duplicate communications, incorrect billing
- Compliance risk: Regulatory reporting with incorrect data
- Rework: Reports need rebuilding when errors are discovered
Research consistently shows that poor data quality costs organisations significant revenue—IBM estimated 3.1 trillion dollars annually for the US economy alone.
Dimensions of Data Quality
Data quality isn't a single measure—it has multiple dimensions:
Accuracy
Does the data correctly represent reality? Is the customer's address their actual address? Is the order amount what was actually charged?
Completeness
Are all required values present? Missing email addresses, null dates, blank customer names. Incomplete data limits analysis.
Consistency
Does the data agree across systems? Does the customer count match between CRM and billing? Do sales figures match between the order system and the data warehouse?
Timeliness
Is data available when needed? Is it current enough for its intended use? Yesterday's inventory data may be useless for real-time ordering decisions.
Uniqueness
Are there duplicates? The same customer entered twice, orders recorded multiple times. Duplicates distort analysis and waste resources.
Validity
Does data conform to expected formats and ranges? Email addresses that aren't emails, dates in the future, negative quantities. Valid format doesn't mean accurate, but invalid format is definitely wrong.
Assessing Data Quality
Data Profiling
Automated analysis of data patterns. Profile columns for: null rates, unique values, value distributions, min/max/average, pattern detection.
Business Rule Validation
Check data against business rules. Order date must be before ship date. Customer age must be 18+. Product price must be positive. Violations indicate quality issues.
Cross-System Reconciliation
Compare data across systems that should agree. Row counts, totals, key values. Differences reveal quality problems in one or both systems.
Quality Metrics to Track
- Percentage of records passing quality rules
- Null rate per critical field
- Duplicate rate
- Cross-system variance
- Data freshness (time since last update)
Improving Data Quality
Fix at the Source
The best time to ensure quality is at data entry. Input validation, required fields, dropdown selections instead of free text. Preventing bad data is easier than fixing it later.
Data Cleansing
Correcting existing data problems:
- Standardisation: Consistent formats, spellings, abbreviations
- Deduplication: Identifying and merging duplicate records
- Enrichment: Adding missing data from external sources
- Correction: Fixing known errors
Ongoing Monitoring
Data quality degrades over time. Customers move, employees leave, systems change. Continuous monitoring catches issues before they accumulate.
Data Quality Tools
Open Source: Great Expectations, dbt tests, Apache Griffin
Commercial: Informatica Data Quality, Talend, Microsoft Data Quality Services
Data Quality Governance
Data Ownership
Assign ownership for each critical data domain. The owner is accountable for quality, definitions, and access. Without ownership, nobody is responsible.
Quality Standards
Define acceptable quality levels. 100% accuracy isn't always practical or cost-effective. Set thresholds based on business impact.
Issue Resolution Process
Clear process for reporting and fixing quality issues. Who investigates? Who fixes root causes? How is progress tracked?
Quality Reporting
Regular reporting on quality metrics. Dashboard showing current state, trends, issues. Make quality visible to leadership.
Practical Approach
- Prioritise: Focus on data that matters most—what drives critical decisions?
- Assess: Profile and measure current quality state
- Clean: Address highest-impact issues first
- Prevent: Fix root causes, improve data entry
- Monitor: Continuous quality checks in data pipelines
- Govern: Establish ownership, standards, and processes
Start small: Don't try to fix all data quality at once. Pick one critical data domain, demonstrate improvement, then expand.
Summary
Data quality is not a one-time project—it's an ongoing discipline. Poor quality data undermines every analytics initiative, eroding trust and leading to bad decisions. Quality has multiple dimensions: accuracy, completeness, consistency, timeliness, uniqueness, and validity.
Effective data quality management combines prevention (fix at source), detection (profiling and monitoring), correction (cleansing), and governance (ownership and standards). Start with the data that matters most and build quality disciplines incrementally.
