I recently met with a data lead who had this reaction to a litany of data inconsistencies and usability problems: “Oh, we don’t have to worry about those. Users know about them. They have workarounds.” It’s a surprisingly common approach on data teams where the sheer bulk of data overwhelms the best intentions. In this company, like many teams I’ve been able to observe, it was dangerous in subtle as well as obvious ways.
Our company, DataKitchen, focuses on applying DataOps principles to support data pipelines. We recently released an open-source data quality screening and assessment tool, TestGen, that identifies and documents a wide range of errors, inconsistencies, and usability issues in SQL datasets. One fascinating benefit of the work has been the chance to review the data pipelines of a large number of companies. We’ve developed a set of Impact Dimensions that can pinpoint the level and locus of risk that test failures signify. Along the way, I’ve found some eye-opening trends in the way organizations can talk themselves out of even acknowledging these issues, much less coming to terms with their effects.
These rationalizations are flimsy, but they are also ubiquitous. And the mindset has only become more damaging, as companies start to focus on AI Readiness, tapping more of their data as a strategic resource in unforeseen ways. The immediate result is to sabotage innovators who should be driving insights and data science advances. The ultimate impact is a snowballing series of failures triggering a loss of trust that can seem intractable.
The obvious problem at this company was that the data team’s “users” — data scientists and analysts focused on building data products and answering business questions — were certainly not aware of all the data problems, triggering cascades of inaccuracies. And when analysts did know, they had different ways of addressing them in their own deliverables, spawning different versions of the truth.
Of course, if data engineers were too swamped to deal with quality problems, analysts and data scientists surely weren’t going to resolve them. Data quality wasn’t their priority, and they were not going to manage it effectively. Self-service had evolved to mean, “I get to do someone else’s job along with mine.” Not good.
It’s important to note that some of these data problems involve business context, but many of them do not. These are not classic data quality issues, where data in a system doesn’t match the business activity it tracks. Many usability issues we screen for — like inconsistent representations of categories or missing values, casing problems, invalid characters, obvious dupes — are artifacts of insufficient validation, processing failures or integration discrepancies. They don’t require long investigations or business rule deep-dives to fix.
But they do create landmines for downstream users, who need to know how and where to find them and then need to spend extra time to clean them up. And when data is swept into AI systems at scale, the impact may not be evident until too late.
Since this data team hadn’t taken responsibility for identifying these issues, they didn’t have their fingerprints on the result. The problems originated upstream, and failures blew up much later. Not their fault, and not their concern. The result, inevitably, was Usability Debt that someone was going to pay downstream.
Through nearly 40 years in this business, I’ve found usability to be a critical, but utterly overlooked, data quality dimension. Data can be accurate, valid, and utterly unusable. It’s never been more important.
What’s really surprising is the side effects. For one thing, data teams can undercut themselves. It’s harder to create effective logical DQ checks when representation is faulty or inconsistent — for exactly the same reason that it’s harder on data consumers later.
But there are unpredictable consequences too. You can’t trace a complex data ecosystem like code. Any enterprise has autonomous agents with their own incentives and goals, unpredictable external changes and unexpected feedback loops. Lapses can have surprising indirect impacts, and that’s exactly what I found.
Consider these emergent effects, reported by stakeholders, that arose not just from the strategy itself, but from analysts’ natural responses. It’s crucial to understand the larger impact of data usability issues on an organization, even if any one issue may seem trivial.
- Analysts felt frustrated and helpless, despite the company’s emphasis on empowerment through self-service. At best, they lost valuable time and productivity. At worst, they were blindsided by data idiosyncrasies. It didn’t always occur to them that these problems could or should have been resolved upstream. They just knew that “data wrangling” was a slog and a minefield.
- Analysts and end-users lost trust — not just in the specific datasets where quality issues existed — but in good datasets and well-managed processes, in successes the data team achieved effectively. Bitten enough times, users feared the unknown-unknowns, whether they were lurking or not.
- Analysts with expertise in workarounds outperformed those who excelled in their own specialties. So skilled data scientists couldn’t innovate as quickly, while “old-hands” — using old approaches — remained irreplaceable.
- Facing this information imbalance, employees felt incentivized to hoard knowledge, not share it. Guess what — the problems grew worse.
- Insider tips were shared in informal networks, since official information was scarce. One young analyst complained to me that “who you know” had become more important than “what you know” on the data team.
- Onboarding was slow and fraught with challenges. New employees took much longer to contribute and didn’t get a fair chance to succeed.
- Outside consultants and vendors couldn’t jump in and add value. The company lost access to invaluable expertise. High-visibility projects were hobbled by flurries of small failures.
- The “old-hands” got their job security, but at the cost of work-life balance, complaining they were continually called into crises during evening hours and vacations.
- Savvy data scientists and analysts lost valuable time, getting drawn inevitably into upstream operational troubleshooting to protect their deliverables. This frustrated them and ultimately impacted turnover.
- Workaround logic was fragmented within deliverables and documents: Tableau formulas, spreadsheets, Jupyter Notebooks. Business rule changes became impossible to implement without breaking something. Data corrections themselves became impossible to make, for fear of breaking downstream workarounds.
The funny thing is that this was in many ways a fantastic company. Every single person I spoke with was committed to excellence. Stakeholders felt safe enough to do candid self-assessments, a credit to them and their corporate culture. In fact, everyone was doing their job as the organization defined it. I’ve seen so many root-cause analyses become a blame game. In this case, there was nobody to blame. But all the problems had a feeling of inevitability about them, like a natural disaster.
You might say the data team should have risen to the occasion and taken responsibility. But actually, their choices were understandable and rational. It wasn’t their job to master or manage the business meaning of the data. Their primary role was to make sure that the outputs they produced accurately reflected the inputs they received. And no one had built the time, effort or budgets into project plans or operations to come to terms with these gaps.
The feedback loop that might have held them to account didn’t really exist. Downstream analysts often see data cleansing and standardization as their own burden. They literally don’t know what they’re missing, and nothing changes. From the company’s perspective, the result was hidden bottlenecks and hidden costs. And if all these costs are misattributed or ignored, there’s no way for management to track or address them.
The worse it got, the more painful it was to admit the problem at all. Bad data can be corrected. But the feedback does catch up with you, and when it does, it hits hard. The loss of trust in a company’s data products becomes a catastrophic issue. There are always more people ready to cast blame than catalyze prevention. And as every chef knows: you may not have caused the rot, but it’s on you if you didn’t smell the fish.
US auto manufacturers didn’t take quality so seriously until foreign competitors took full advantage of their complacency. Today in manufacturing it’s a truism that the cost of prevention can be 100 to 1,000 times less than the cost of a crisis. Funny how, even in manufacturing companies, data work is still artisanal. Standardizing how data is depicted sounds as trivial as standardizing the size of bolts and screws. Maybe it’s time to realize how crucial it is.
For minimal cost, a tool like DataKitchen’s TestGen can define and enforce usability standards across a wide range of data deliverables. Or you can convince yourself it’s somebody else’s problem.
But be warned. A data team that relies on downstream workarounds to handle data quality issues is causing more harm than they realize. Dirty data carries a high cost, and usability is no luxury. Leaders may assume everything is ok because a system like this keeps sputtering along; in truth, they’re missing a vital opportunity to improve their organization along with their results.
