Peter Piper on the Four Ps of AI Data Quality: Purge, Patch, Push Back, or Pass

How does a data team prevent poor data from poisoning AI when they have piles of raw and imperfect data?

How does a data team prevent poor data from poisoning AI when they have piles of raw and imperfect data?

Teams responsible for data used to train AI models (e.g., LLMs) face a persistent problem: piles of raw, imperfect data. Pressure builds to process quickly, publish promptly, and push data into pipelines. But passing problematic data into production-powered models can produce biased predictions, polluted patterns, and poor performance.

Before pipelines proceed, managers must pause and pick a path. In practice, there are four practical options for handling raw data: Pass, Purge, Patch, Push Back

Let’s walk through the four options in a pragmatic progression.

1. Pass: The Path of Least Preparation

You can do nothing and simply pass all data to the model. No profiling. No policing. No protection

This path promises speed and simplicity. It is the least amount of work, but also the largest potential risk. Poorly prepared data propagates problems downstream, where models may:

  • Learn patterns that never persist
  • Produce predictions based on polluted fields
  • Perpetuate problems at production scale

Passing should be a conscious, calculated choice – not a default. 

Before passing, teams should profile data quality and consider the remaining paths.

2. Purge: Preventing Poisoned Patterns

When a record contains a clear data quality problem, the most prudent path may be to purge it.

Purge means delete.

Examples include:

  • A date in the future when future values are impossible
  • A required field that is missing
  • A primary key that is null or malformed

Purging prevents polluted records from poisoning patterns learned by AI models. While purging reduces volume, it protects validity and preserves precision.

This is not punishment – it is protection.

3. Patch: Precise, Programmatic Problem-Solving

Sometimes, problems are predictable — and patchable.

If your team knows how to fix an issue safely, patching is powerful.

Examples include:

  • Misspelled fields or values
  • Multiple phrases for the same concept: “MIT” vs. “Massachusetts Institute of Technology.”
  • Missing identifiers that can be populated from permitted public sources”. Example: NPI numbers from the National Plan and Provider Enumeration System (hhs.gov)

Patching preserves records while improving precision. It is particularly powerful when:

  • Problems are patterned
  • Fixes are provable
  • Processes are programmable

Patch with purpose – not guesswork.

4. Push Back: Partner Pressure for Proper Data

Sometimes the problem is upstream.

When data comes from providers, platforms, or partners, teams can push back:

  • Send suppliers a precise list of problematic elements
  • Provide proof, percentages, and problem patterns
  • Request correction at the source

You have more leverage when:

  • You pay for the data
  • You have published data quality requirements
  • Your contracts include quality provisions

Pushing back promotes partnership, not punishment. It improves future feeds, reduces repeated patching, and produces more predictable pipelines.

Assess Data Quality with DataKitchen TestGen

Before picking an option that requires action, teams must assess data quality.

DataKitchen’s TestGen enables teams to:

  • Profile datasets quickly
  • Pinpoint problematic records
  • Produce precise data quality tests
  • Output detailed issue listings
  • Publish professional reports for partners and providers

TestGen helps teams decide:

  • Which records to patch
  • Which records to purge
  • When to push back
  • When it’s safe to pass

Most importantly: don’t let bad data pass blindly.

Conclusion: Purposeful Preparation Produces Powerful Predictions

Passing poor data produces predictable problems. Purposeful preparation prevents polluted pipelines. By profiling proactively, purging problematic records, patching predictable problems, and pushing back on poor providers, teams gain control, confidence, and credibility.

With precise profiling, principled processes, and practical platforms like TestGen, managers can protect pipelines, promote performance, and produce powerful, polished AI models — not by chance, but by plan.

author avatar
Gil Benghiat
Gil Benghiat is one of three founders of DataKitchen, a company on a mission to enable analytic teams to deliver value quickly and with high quality. Gil’s career has always been data oriented and has included positions collecting and displaying network data at AT&T Bell Laboratories (now Alcatel-Lucent), managing data at Sybase (purchased by SAP), collecting and cleaning clinical trial data at PhaseForward (IPO then purchased by Oracle), integrating pharmaceutical sales data at LeapFrogRx (purchased by Model N), and liberating data at Solid Oak Consulting. Gil holds an MS in computer science from Stanford University and a BS in applied mathematics and biology from Brown University.

Sign-Up for our Newsletter

Get the latest straight into your inbox

DataOps Data Quality TestGen:

Simple, Fast, Generative Data Quality Testing, Execution, and Scoring.

[Open Source, Enterprise]

DataOps Observability:

Monitor every data pipeline, from source to customer value, & find problems fast

[Open Source, Enterprise]

DataOps Automation:

Orchestrate and automate your data toolchain with few errors and a high rate of change.

[Enterprise]

recipes for dataops success

DataKitchen Consulting Services


DataOps Assessments

Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Data & Analytics Platform for Pharma

Get trusted data and fast changes to create a single source of truth

 

dataops-cookbook-download

DataOps Learning and Background Resources


DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!

 

DataKitchen Basics


About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts

Careers

Come join us!

Contact

How to connect with DataKitchen

 

DataKitchen News


Newsroom

Hear the latest from DataKitchen

Events

See DataKitchen live!

Partners

See how partners are using our Products

 

Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.