The Syntax, Semantics, and Pragmatics Gap in Data Quality Validation TestingĀ 

What is the full range of data quality validation tests for data at rest and data in use? Linguistics provides an organizing principle: syntax, semantics, and pragmatics

The Syntax, Semantics, and Pragmatics Gap in Data Quality Validate TestingĀ 

Data Teams often have too many things on their ā€˜to-doā€™ list. Ā  Customers are asking for new data, people need questions answered, and the tech stack is barely running ā€“ data engineers donā€™t have time to create tests.Ā  Ā  They have a backlog full of new customer features or data requests, and they go to work every day knowing that they won’t and can’t meet customer expectations.

To understand the time crunch that data engineers face and how it prevents them from testing their data sufficiently, we can think about three categories of data tests, using linguistics as an analogy: data syntax vs. semantics vs. pragmatics. Ā  In linguistics, Syntax is the study of sentence structure and grammar rules. While people can do what they want with language (and many often do), syntax helps ordinary language users understand how to organize words to make the most sense.

On the other hand, semantics is the study of the meaning of sentences.Ā  The sentence ā€œColorless green ideas sleep furiouslyā€ makes syntactic sense but is meaningless. Ā  Pragmatics takes semantics one step further because it studies the meaning of sentences within a particular context.Ā Ā 

Letā€™s apply this idea from linguistics to data engineering challenges.

Syntax-Based Profiling and Testing:Ā  By profiling the columns of data in a table, you can look at values in a column to understand and craft rules about what is allowed for a column. Ā  For instance, if a column is filled with US zip codes, a row should not have the word ā€˜ā€™baconā€ in it. Ā  Data engineers need a quick way to profile data quickly and automatically cast a wide net to catch data anomalies. Itā€™s analogous to setting up a burglar alarm in your home by deploying sensors at all possible entrances to catch a burglar who may only try one window.Ā  Automatically creating tests from profile data allows teams to maintain maximum sensitivity to real problems while minimizing false positives that are not worth the follow-up.

Semantics-Based Business Rule Testing.Ā  What is a meaningful test for your business?Ā  Do you know as a data engineer?Ā  For example, you can compare current data to previous or expected values. These tests rely upon historical values as a reference to determine whether data values are reasonable (or within the range of reasonable). For example, a test can check the top fifty customers or suppliers. Did their values unexpectedly or unreasonably go up or down relative to historical values?Ā  What is the acceptable rate of change?Ā  10%? 50%?Ā  Data engineers are only sometimes able to make these business judgments.Ā  They must thus rely on data stewards or business customers to ā€˜fill in the blankā€™ on various data testing rules.

Pragmatics-Based Custom Testing. Ā Many companies have widely diverging business units.Ā  For example, a pharmaceutical company may be organized into Research and Development (R&D), Manufacturing, Marketing and Sales, Supply Chain and Logistics, Human Resources (HR), and Finance and Accounting.Ā  Each unit will have unique data sets with specific data quality test requirements.Ā  Drug discovery data is so different from manufacturing data that data test cases require unique domain knowledge or a specific, pragmatic business context based on each groupā€™s unique data and situation.Ā 

How to Do Data Syntax-Based Testing:Ā  DataOps TestGen

DataOps TestGen’s first step is to profile data and produce a precise understanding of every table and column. It looks at 51 different data characteristics that have proven critical to developing practical data tests, regardless of the data domain. TestGen then performs 13 ‘Bad Data’ detection tests, providing early warnings about data quality issues, identifying outlier data, and ensuring data are of the highest quality.

One of the standout features of DataOps TestGen is the power to auto-generate data tests. With a library of 28 distinct tests automatically generated based on profiling data, TestGen simplifies the testing process and saves valuable time. These tests require minimal or no configuration, taking the heavy lifting out of your hands, so you can focus on what matters ā€“ extracting insights from your data.

How to Do Data Semantics-Based Testing:Ā  DataOps TestGen

TestGen also offers 11 business rule data tests that, with minimal configuration, can be used for more customized tests. These tests allow users to customize testing protocols to fit specific business requirements with a ā€œfill in the blankā€ model, offering a perfect blend of speed and robustness in data testing. These types of tests ensure your data not only meets general quality standards but also aligns with your unique business needs and rules. Data stewards, who may know more about the business than a data engineer, can quickly change a setting to adjust the parameters of a data test – without coding.

How to Do Data Pragmatics-Based Testing: DataOps Automation

Every company is unique. Every company has complex data and tools within each team, product, and division. Monitoring and testing the data to ensure its reliability continually is crucial. It is crucial to build domain-specific data validation tests and, for example, the results of data models for accuracy and relevance, evaluate the effectiveness of data visualizations. Checking the result of a model, an API call, or data-in-use in a specific analysis tool is critical to ensure that data delivery mechanisms are operating optimally. DataOps Automation provides robust testing and evaluation processesĀ throughout the ‘last mile’Ā of the Data Journey.

Conclusion

DataKitchenā€™s DataOps Observability product enables this Data Journey monitoring and alerting.Ā  DataOps Observability is designed to seamlessly extract these status details and test results, including those produced by DataOps TestGen and Automation from every Data Journey, quickly with no changes to current jobs and processes, and compare them to expectations and alert when variances exist – allowing data teams to use and share the data test results that solve the syntax, semantics, and pragmatics gap in data quality validation testing.

Sign-Up for our Newsletter

Get the latest straight into your inbox

Open Source Data Observability Software

DataOps Observability: Monitor every Data Journey in an enterprise, from source to customer value, and find errors fast! [Open Source, Enterprise]

DataOps Data Quality TestGen: Simple, Fast Data Quality Test Generation and Execution. Trust, but verify your data! [Open Source, Enterprise]

DataOps Software

DataOps Automation: Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change. [Enterprise]

recipes for dataops success

DataKitchen Consulting Services


Assessments

Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Pharma Agile Data Warehouse

Get trusted data and fast changes from your warehouse

 

dataops-cookbook-download

DataOps Learning and Background Resources


DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!

 

DataKitchen Basics


About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts

Careers

Come join us!

Contact

How to connect with DataKitchen

 

DataKitchen News


Newsroom

Hear the latest from DataKitchen

Events

See DataKitchen live!

Partners

See how partners are using our Products

 

Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.