Data Quality Management: The Foundation of Trusted Analytics

The cost of bad data

Data quality is the foundation of everything you do with analytics. Bad data leads to bad decisions: a dashboard showing the wrong revenue figures, customer segments built on incorrect profiles, AI models trained on garbage. And the worst part: most organisations don't know their data quality is poor until a highly visible failure exposes the problem.

The real costs:

Bad decisions. Strategy built on incorrect information. You can't undo a bad decision because "the data said so."
Lost productivity. Staff spend time finding, verifying, and fixing data instead of analysing it. In many organisations, data preparation consumes 80% of analyst time.
Eroded trust. Once people catch the data being wrong, they stop believing it entirely, even when it's correct. Trust is hard to rebuild.
Customer impact. Wrong addresses, duplicate communications, incorrect billing. Your data quality problem becomes your customer's problem.
Compliance risk. Regulatory reporting with incorrect data. Financial reporting that doesn't reconcile. Audit findings.
Rework. Reports need rebuilding when errors are discovered. Analyses need re-running. Decisions need revisiting.

Quality dimensions

Data quality isn't a single measure. It has six distinct dimensions, and you can have problems in any combination.

Accuracy

Does the data correctly represent reality? Is the customer's address their actual current address? Is the order amount what was actually charged? Accuracy errors are the hardest to detect because the data looks plausible.

Completeness

Are all required values present? Missing email addresses, null dates, blank customer names. Incomplete data limits what you can analyse. If 30% of customers have no postcodes, geographic analysis is unreliable.

Consistency

Does the data agree across systems? Does the customer count match between CRM and billing? Do sales figures reconcile between the order system and the data warehouse? Inconsistency means at least one system is wrong, and you don't always know which one.

Timeliness

Is data available when needed? Is it current enough for its intended use? Yesterday's inventory data is useless for real-time ordering. Last month's sales data is fine for board reporting.

Uniqueness

Are there duplicates? The same customer entered twice with slightly different names. Orders recorded multiple times. Duplicates inflate totals, distort analysis, and waste resources (sending the same marketing email twice to one person).

Validity

Does data conform to expected formats and business rules? Email addresses that aren't emails. Dates in the future for historical records. Negative quantities on orders. Invalid data is easy to detect but often ignored because "we'll fix it later."

Assessing quality

Data profiling

Automated analysis of your data patterns. Profile each column for: null rates, distinct values, value distributions, min/max/average, and format patterns. Most data profiling can be done with SQL queries or tools like Great Expectations. The results are usually sobering.

Business rule validation

Check data against rules that should always hold true. Order date must be before ship date. Customer age must be 18+. Product price must be positive. Every violation is a data quality problem, and the volume of violations tells you how widespread the issue is.

Cross-system reconciliation

Compare data across systems that should agree. Total orders in the e-commerce platform vs the warehouse system. Customer counts in the CRM vs the billing system. Differences reveal quality problems in one or both systems.

Key metrics to track: Percentage of records passing quality rules. Null rate per critical field. Duplicate rate. Cross-system variance. Data freshness (age of last update). Track these over time. The trend matters as much as the absolute number.

Improving quality

Fix at the source

The cheapest way to ensure quality is at data entry. Input validation, required fields, dropdown selections instead of free text, format masks, real-time lookups. Preventing bad data from entering is always cheaper than cleaning it up later.

Data cleansing

For data that's already dirty:

Standardisation: consistent formats, spelling, abbreviations ("NSW" not "New South Wales" not "nsw")
Deduplication: identify and merge duplicate records using fuzzy matching
Enrichment: add missing data from external sources (address validation, ABN lookups)
Correction: fix known errors based on business rules or reference data

Ongoing monitoring

Data quality degrades over time. Customers move, employees leave, systems change, integrations break silently. Continuous monitoring catches issues before they accumulate into a crisis. Build quality checks into your data pipelines, not as an afterthought, but as a first-class concern.

Useful tools

Open source: Great Expectations, dbt tests, Apache Griffin, Soda Core
Commercial: Informatica Data Quality, Talend, Monte Carlo, Atlan

Governance

Data ownership

Assign an owner for each critical data domain. Finance owns financial data, Sales owns pipeline data, Operations owns inventory data. Owners are accountable for quality, definitions, and access. Without clear ownership, nobody is responsible, and nothing improves.

Quality standards

Define acceptable quality levels for each domain. 100% accuracy isn't always practical or cost-effective. Set thresholds based on business impact. Financial reporting data needs higher accuracy than marketing analytics data.

Issue resolution

Clear process for reporting and fixing quality issues. Who investigates? Who fixes root causes (not just symptoms)? How is progress tracked? Without a process, issues get reported but never resolved.

Quality dashboards

Regular reporting on quality metrics visible to leadership. When quality is visible, it gets attention. When it's invisible, it degrades until something breaks publicly.

A practical approach

Prioritise. Focus on data that drives your most critical decisions first. Don't try to boil the ocean.
Assess. Profile the data. Measure current quality across all six dimensions. The results will tell you where to focus.
Clean. Address the highest-impact issues first. Quick wins build credibility for the program.
Prevent. Fix root causes: improve data entry, add validation, fix broken integrations.
Monitor. Build quality checks into data pipelines. Catch issues automatically before they reach dashboards.
Govern. Establish ownership, standards, and processes. Make quality an ongoing discipline, not a one-off project.

Frequently asked questions

How bad is our data likely to be?

Worse than you think. Most organisations that haven't invested in data quality find 10–30% of critical records have at least one quality issue: missing values, duplicates, inconsistencies. The problems are often invisible until you profile the data systematically.

Should we fix data quality before starting analytics?

Don't wait for perfect data. You'll never start. But address the most critical issues affecting your priority analytics use cases. If your customer addresses are 40% incomplete, fix that before attempting geographic analysis. Good enough beats perfect.

How much should we invest in data quality?

As a rule of thumb, data quality should be 5–10% of your total data and analytics investment. It's tempting to skip this and spend everything on flashy dashboards, but dashboards built on bad data actively harm decision-making.

Key takeaways

Bad data leads to bad decisions. A report showing wrong revenue figures is worse than no report at all.
Data quality has six dimensions: accuracy, completeness, consistency, timeliness, uniqueness, and validity.
Prevention beats correction. Input validation and required fields at data entry are cheaper than cleaning data later.
Start with the data that drives your most critical decisions. Don't try to fix everything at once.

Postgraduate Researcher (AI & RAG), Curtin University - Western Australia

View profile →