DataKitchen’s Data Quality TestGen Found 18 Potential Data Quality Issues In A Few Minutes!

Imagine a free tool that you can point at any dataset and find actionable data quality issues immediately! I took DataKitchen's Data Quality TestGen for a test drive on ~600k rows of Boston City data and found 18 data quality hygiene issues in a few minutes.

DataKitchen’s Data Quality Testgen Found 18 Potential Data Quality Issues In A Few Minutes (Including Install Time) On Data.Boston.Gov Building Permit Data!


Imagine a free tool that you can point at any dataset and find actionable data quality issues immediately!   It sure beats having your data consumers tell you about problems they find when you are trying to enjoy your weekend.  There is such a tool:  DataKitchen’s Data Quality TestGen.  I took TestGen for a test drive on ~600k rows of Boston City data and found 18 data quality hygiene issues in a few minutes.

First, I grabbed building permit data from Approved Building Permits from data.boston.gov and used Superset to load it into a Postgres DB.  Second, I installed TestGen, which took just a few minutes. Then, I started profiling, which completed in 3 minutes and 16 seconds on 646,899 records containing 23 columns (14,878,677 data points).  TestGen finds potential data quality issues during the profiling process!   Here is a summary of the data hygiene issues that TestGen found:

Try TestGen on your data to have your data suppliers improve their data and alert your users to potential issues.  Here are TestGen’s install instructions.

The one “Definite” finding was non-standard blank values in the comment column.  It turns out that in this data set, there were four ways to indicate no comment for the permit:  [NULL], “?”, “none”, and “tbd”.  You can see the frequencies from this screenshot:

What else did TestGen find?   It seems like the city should be known for a permit, so I took at the findings on the city column:

Good news:  Only 0.03% of the records had no city.  Bad news: 185 permits were missing the city.  This could be an issue for a quality circle or a data steward to fix.

Looking further at the city column and pressing the “Source Data” button for the “Similar Values Match When Standardized” Hygiene Issue, I see that some names have an initial capital and others are all uppercase.  Are two systems feeding this dataset?  There are a few slashes that worked their way into the data, too: 

Zooming in on some values:

A little more scrolling shows that a stray quote worked its way in there three times.

Zooming in on the values:

Switching to the profiling results on the city column and looking at the “Value Analysis” section, I observed that while the Maximum Text looked good, “WEST ROXBURY”, I did not recognize the city with the Minimum Text, “/”. 

A quick query revealed that there were 8 records with a city of “/”

select * from bostonpermits where length(city) < 5;

The applicant field had similar issues:

Shouldn’t every permit have an applicant?  1.47% blank is low, but that is still 9,365 permits with no applicant.  Also, if one wanted to analyze applicants, the field would benefit from some standardization.

Zooming in on the values:

Twelve other data hygiene issues were flagged and are not discussed in this article. You can try this experiment on your computer and see what you think. Better yet, try it on your own data.

My next steps were to generate data quality tests (this took about 2 seconds).  These tests can be run in production to quickly catch errors on new batches of Boston permit data.  To benchmark the testing operation, I ran the tests on the existing dataset, which took 23 seconds. 

Here was my setup:

  • Data Quality Tool:  DataKitchen DataOps TestGen v2.7.0 open-source edition
  • Superset running in docker
  • Postres SQL running in docker
  • SQL Editor:  DBeaver
  • Hardware:  MacBook Pro, Processor 2.2 GHz 6-Core Intel Core i7, Memory 16 GB

Software cost: $0



I invite you to try TestGen on your data to have your data suppliers improve their data and alert your users to potential issues.  Here are TestGen’s install instructions.

 

Sign-Up for our Newsletter

Get the latest straight into your inbox

Open Source Data Observability Software

DataOps Observability: Monitor every Data Journey in an enterprise, from source to customer value, and find errors fast! [Open Source, Enterprise]

DataOps Data Quality TestGen: Simple, Fast Data Quality Test Generation and Execution. Trust, but verify your data! [Open Source, Enterprise]

DataOps Software

DataOps Automation: Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change. [Enterprise]

recipes for dataops success

DataKitchen Consulting Services


Assessments

Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Data & Analytics Platform for Pharma

Get trusted data and fast changes to create a single source of truth

 

dataops-cookbook-download

DataOps Learning and Background Resources


DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!

 

DataKitchen Basics


About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts

Careers

Come join us!

Contact

How to connect with DataKitchen

 

DataKitchen News


Newsroom

Hear the latest from DataKitchen

Events

See DataKitchen live!

Partners

See how partners are using our Products

 

Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.