Data Quality When You Don’t Understand the Data: Data Quality Coffee With Uncle Chip #3
Let’s be honest—data quality feels impossible when you don’t understand the data. And in large organizations, that’s not a rare problem. It’s the norm. We’ve seen it firsthand: massive data estates maintained by teams who don’t know what the numbers, strings, or categories in their tables really mean. This isn’t because people are lazy or careless. It’s structural. Domain experts and technical folks operate on parallel tracks—outsourced, offshored, remote, and siloed. They often don’t even speak the same language, let alone share the same Slack channels.
That’s the crux of the issue: you can’t test for what you don’t understand. And so, we don’t. We skip it. We assume. Or worse, we wait until something breaks. Until a business leader notices that a dashboard shows the wrong customer segments or a machine learning model recommends absurd products. This reactive, fire-drill culture around data quality is like posting a guard at the front door while the back door is wide open. We’re trying to protect our data, but we’re guarding the wrong entry points—and often doing it way too late.
In most enterprises, documentation is stale or useless. People who once understood the data have moved on. Meanwhile, new projects are spun rapidly, and datasets are reused for purposes far removed from their original intent. AI, predictive modeling, real-time analytics—none of these cutting-edge initiatives can succeed without clean, reliable data. But how do you build confidence in the data if you have no context, no SME on speed dial, and no time to write detailed tests?
That’s where DataOps Data Quality TestGen comes in.
DataKitchen created TestGen because we were tired of the myth that you must know everything about your data before testing it. That’s not true. You can start with common sense—and scale from there. TestGen is open-source software that runs a series of tests on your data without requiring you to write a single line of YAML, SQL, or Python. It begins with a profiling step: it reads your data column by column, learns its shape, content, and structure, and immediately flags issues obvious to any human, but usually overlooked by automated tools.
TestGen covers 27 kinds of data hygiene problems: from the usual suspects like missing values and bad formats to more subtle issues like inconsistent categories, skewed distributions, and internal contradictions. It also flags 24 different types of personally identifiable information (PII). This is especially important in today’s privacy-conscious world, where protecting sensitive information is not optional—it’s mandatory. And let me tell you, most teams miss the obvious stuff. It’s not just about encryption and access control; it’s about knowing what you’ve got in the first place.
What’s powerful about TestGen is how it uses profiling to infer rules. It builds a baseline of your data—what it looks like when it’s healthy—and then checks future data against that baseline. Any deviation gets flagged, even if it’s just a slight shift in category frequency or a sudden drop in data recency. Think about it: if a critical column suddenly has a dozen new codes that no one’s seen before, wouldn’t you want to know? That’s not just hygiene—it’s how you spot systemic breakdowns before they reach the business layer.
We also baked in tests for transactional data freshness, statistical outliers, duplicate detection, uniqueness violations, and more. These tests don’t need a PhD in data science to understand or run. They give you visibility into what’s breaking and why, without needing perfect documentation or tribal knowledge. And that’s the key: you don’t have to be a business expert to start improving data quality. You just have to start.
The truth is, most people working in data are waiting for someone else to fix things. Don’t wait. Be the one who turns on the lights. Start running tests. See what you find. Refine your rules over time. You don’t need to be an expert to make a difference—you just need initiative, and maybe a little help from software built for this messy, complex world.
So here’s our advice: Download TestGen. It’s free, open source, and takes minutes to connect. Start measuring your data’s quality and surfacing the issues hiding in plain sight. You’ll be amazed at what you find. And if you hit a wall or have a really gnarly problem to solve, reach out to DataKitchen!.
Because when we stop pretending we understand every data column and acknowledge that we don’t, we can finally begin to see the problems and do something about them. And in our book, that’s where real data quality begins.