Clean Data for AI: Why Data Quality Matters Before You Automate

What does "clean data" mean?

Clean data isn't perfect data. It's data that's consistent, reasonably current, and structured enough that systems (including AI) can read and use it reliably.

In practical terms, clean data means:

Consistent formatting: dates in the same format, phone numbers normalised, addresses structured, categories spelled the same way every time
No (or few) duplicates: one record per entity, not three records for the same customer with slight name variations
Reasonably current: contact details updated within the last 12–24 months, stale records flagged or archived
Required fields populated: the fields that matter for your workflow are actually filled in, not blank
Correct relationships: invoices linked to the right customer, products linked to the right category, contacts linked to the right company

That's it. You don't need data science–grade perfectionism. You need operational reliability.

Why it matters for AI

AI doesn't fix bad data. It processes it faster.

If your CRM has three records for the same customer (John Smith, J. Smith, and Smithy John), an AI-powered email triage system will create a fourth. If your product catalogue has inconsistent categories, AI will classify new items inconsistently too.

This is the "garbage in, garbage out" problem, but accelerated. Manual processes have humans who notice anomalies and quietly correct them. Automated processes run at scale without that safety net.

The maths is simple:

Manual process with 100 records/week + human quality checking = small, contained errors
AI process with 1,000 records/week + no quality baseline = large, compounding errors

Clean your data before you automate, not after.

Common data quality problems

Duplicates

The most common problem. The same entity (customer, supplier, product) appears multiple times with slight variations. This fragments your data: half the invoices are on one record, half on another. AI can't reconcile what you haven't reconciled yourself.

Inconsistent formatting

Dates stored as "15/04/2025", "April 15 2025", and "2025-04-15" in the same column. Phone numbers with and without area codes. Addresses with and without state abbreviations. AI can handle some variation, but inconsistency reduces accuracy.

Missing fields

Required fields left blank: no email address on the contact, no ABN on the supplier, no category on the product. If AI needs a field to route, classify, or match, and it's empty, the automation fails.

Stale records

Records that haven't been updated in years. Former employees still marked as active. Suppliers you haven't used since 2019. Products that were discontinued. Stale records add noise and reduce the signal AI can extract.

Broken relationships

Invoices linked to the wrong customer. Contacts associated with the wrong company. Products in the wrong category. Relationship errors cascade through any system that relies on those connections.

How to assess your data

A data quality assessment doesn't need to be a six-month project. For most small to mid-size businesses, a focused two-week assessment covers it:

Pick the data sets that matter. Which data will your AI project use? Customer records? Supplier records? Product catalogue? Invoice history? Focus there.
Check completeness. What percentage of records have the required fields populated? A CRM with 40% of contacts missing email addresses is a problem for any email-related automation.
Check uniqueness. How many duplicate records exist? Run a deduplication analysis using fuzzy matching on key fields (name, email, ABN).
Check consistency. Are formats consistent within each field? Sample 100 records and check.
Check currency. When were records last updated? What percentage are older than 12 months with no recent activity?
Document the findings. Summarise what's clean, what's fixable, and what's a blocker. This becomes your data remediation plan.

Fixing your data

Data remediation follows a practical priority order:

Quick wins (days, not weeks)

Archive or flag stale records (no activity in 2+ years)
Normalise obvious formatting issues (date formats, phone numbers, state abbreviations)
Merge obvious duplicates (exact name + exact email matches)

Medium effort (1–2 weeks)

Fuzzy duplicate resolution: review and merge records that are probably duplicates but need human confirmation
Field population: fill in missing required fields from source documents, websites, or direct contact
Category cleanup: standardise categories, tags, and classifications

Ongoing discipline (permanent)

Set validation rules on data entry forms: required fields, format constraints, duplicate checks
Schedule periodic data quality audits: quarterly is usually sufficient
Assign data ownership: someone is responsible for each data domain

What "good enough" looks like

You don't need perfection. Here are practical benchmarks:

Metric	Minimum for AI	Good
Required fields populated	80%	95%+
Duplicate rate	<10%	<3%
Format consistency	85%	95%+
Records current (updated <24 months)	70%	90%+

If you're above the minimum thresholds, you can start an AI project and improve data quality in parallel. If you're below, fix the data first. The AI project will fail otherwise.

Frequently asked questions

Can AI fix our data for us?

Partially. AI can help with deduplication (identifying probable duplicates), normalisation (standardising formats), and enrichment (filling gaps from external sources). But it can't fix fundamental problems like incorrect data or broken relationships without human oversight.

How long does a data cleanup take?

For a typical small-to-mid business CRM or ERP: 2–4 weeks for the initial cleanup, with ongoing maintenance discipline after that. Larger organisations with multiple systems may need 6–8 weeks.

Do we need a data warehouse?

Not necessarily. For most AI projects, working with data directly from your CRM, ERP, or operational systems is fine. A data warehouse is useful when you need to combine data from many sources or run complex analytics, but it's not a prerequisite for AI automation.

What if we have data in spreadsheets?

Spreadsheets are the most common data source in Australian SMEs. They're fine as a starting point, but migrating to a structured system (CRM, database, or at minimum a consistent spreadsheet format) should be on your roadmap. AI can read spreadsheets, but maintaining data quality in them is harder.

Should we clean everything before talking to an AI vendor?

No. Talk to us early. A good AI partner will help you prioritise what to clean (and what doesn't matter for your specific use case). Some data sets need a lot of work; others are already fine. We'd rather assess your data and give you an honest answer than have you spend months cleaning data that doesn't need it.

Key takeaways

AI amplifies your data quality. Good data in, good results out. Bad data in, bad results out, faster.
Most businesses don't need perfect data. They need data that's consistent, reasonably current, and structured enough for AI to read.
The biggest data quality issues are duplicates, inconsistent formatting, stale records, and missing fields. All fixable.
Assess your data quality before starting an AI project. A $5K data audit can save you $50K in failed automation.

Postgraduate Researcher (AI & RAG), Curtin University - Western Australia

View profile →