Writing DataOps Tests with the DataKitchen Platform

Tests identify data and code errors in the analytics pipelines.ย Automated orchestrationย of tests is especially important in heterogeneous technical environments with streaming data. The DataKitchen Platform makes it easy to writeย testsย that check and filter data across the end-to-end data pipeline.

Nodes, Recipes, Orders, and Variations

Our discussion of testing will flow more easily if we define some terminologyย  upfront. The data pipeline is best conceptualized as aย directed-acyclic graphย (DAG), as shown in the diagram below. Here are some terms that are helpful to DataKitchen users:

  • Nodeย – Each Node in the DAG can represent some processing or transformation operation.ย  A pipeline Node may contain substeps and tests. A DataOps Platform incorporates complex toolchains into a coherent work environment. When the graph spans a heterogeneous technical environment, each Node includes the appropriate tooling-specific steps.
  • Recipeย – defines a collection of one or more Nodes for automated orchestration
  • Orderย – an orchestrated execution of a Recipe
  • OrderRunย – a specific instance of an Order execution
  • Recipe-Variationย (or justย Variation) – a Recipe that executes based on parameters Parameters include graph definition, Order schedules, runtime resource configuration, tooling-instance connections, source-data instances, and more


Figure 1: The data pipeline illustrated as a directed-acyclic graph.

In DataOps, automated testing assures quality and verifies the absence of errors. The data team writes tests to validate each Nodeโ€™s inputs and outputs. Using the DataKitchen Platform, tests can be defined to execute as part of data pipeline orchestration. Tests span a wide range of complexity, from simple metrics, like checking row counts, to evaluating results using statistical controls or sophisticated business logic. Tests may also be configured to take action. For example, a failed test may stop an OrderRun in place, transmit an alert, or simply log results.

This post discusses the step-by-step approach to writing a Node input test using the DataKitchen Platform.

Test to Verify Record Count

Step 1: Define the runtime variable

The CSV file โ€˜global superstoreโ€™ is ingested into a database using an SQL script. The last line in the SQL script creates a variable that stores the number of records in the data table. We would like to add a test that verifies that the record count is above a certain threshold.

The screenshot below shows the DataKitchen UI for creating a test of a Variation. Since the count is a scalar value, we select the result type asย scalar valueย and define a runtime variableย result_global_superstore.

Figure 2: The DataKitchen UI provides the user with a simple way to create tests of Variations.

Step 2: Select the Test tab and add a test

Click on the test tab at the top of the screen and selectย +Add test.

Figure 3: The user adds a test of a Variation using a simple UI.

The system creates a default โ€˜test1โ€™. We specify the details of the test on the right-hand side of the screen.

Step 3: Select the test variable, comparison, and control value

We change the test name toย test_global_superstore โ€“ย a more meaningful name. The description field can be used to describe the test you are performing and why it is needed. The Failure Action field determines what happens if your Node test fails. Here are the standard options:

  • Stop: The OrderRun will be stopped, and subsequent Nodes will not be executed.
  • Warn: The Order will continue to run, but a warning message will be displayed.
  • Log: The results will be logged whether the test passes or not.

Let us select theย Warnย option for this test.

The Test Logic section defines the test condition:

  • โ€˜Compare Variable against Metricโ€™ uses the UI to build the test.
  • โ€˜Specify Python Expressionโ€™ requires a python expression.

For this post, we create a simple test to evaluate the table row count using the option,ย Compare Variable against Metric.

Click on the โ€˜Test Variableโ€™ field to view a list of available variables. Selectย result_global_superstore, which was defined earlier. Suppose that we wish to verify that the value is greater than ten. In a real-world application, this might be a historical balance test. For example, the number of YTD sales orders should never decline.

Figure 4: The test variable name, type, and comparison value are defined using the DataKitchen UI.

In theย Type Conversionย field select โ€˜integerโ€™ from the drop-down list to assignย result_global_superstoreย a type.Figure 5 image4
Figure 5: Test variables can be assigned to one of several basic data types.

In the comparison field, select โ€˜>โ€™ from the drop-down list and set the โ€˜Control Valueโ€™ to ten. Click update.

Figure 6: The test compares the variableย result_global_superstoreย to a control value.

Step 4: Run Variation and check test results

Click theย Run Variationย option in the DataKitchen UI. Then click theย Runย button in theย Run Variationย overlay.

Figure 7: Executing the Variation using the DataKitchen UI.

Confirm the Variation execution by clicking theย Runย button.

Figure 8: Theย Run Variationย dialogue box.

After running the Variation, view Order status by clicking on the โ€˜Order IDโ€™ link to open theย Ordersย screen.

Figure 9: After initiating an OrderRun, navigate to the Orders screen using the link shown.

On the Orders screen, click theย Refresh Data From Serverย button if needed to display theย OrderRun IDย for the Order. In this case, our Order has completed running, and the Order status isย Order Complete.

Figure 10: The Orders table shows that the OrderRun is complete.

Click the linkedย OrderRun ID, on theย Ordersย screen to advance to theย OrderRun Detailsย screen. Theย OrderRun Detailsย screen shows the Recipe map. The color of each Node reflects its status:

  • Blue – Order is still active
  • Red – Error in the OrderRun
  • Green – OrderRun is complete

Figure 11: The OrderRun containing one Node Global_Superstore has completed.

Scroll below the graph and expand theย Test Resultsย section. In the Test Results, we can see that the test did not fail, it did not produce a warning, and it did not record any messages in the log. Theย Tests: Passedย section shows one test for the Node โ€˜Global_Superstoreโ€™, named โ€˜test_global_superstoreโ€™, and the values from that test. This is the test that we defined above. Nineteen records were counted in the data table, and this is indeed greater than our โ€˜Control Valueโ€™ of ten.

Figure 12: Test results for the OrderRun of Node Global_Superstore

Test to Check the Latest Order Week

Letโ€™s take a look at another example test. The test in this section verifies that we are using the latest version of the data. If we are using stale data, the difference between the date of the most recent Node and todayโ€™s date will be more than one week. In the below screenshot, we can see that the last line of the SQL script calculates the difference in weeks between the current system date and the maximum (latest) sales order date. The resultant value is assigned to the runtime variableย result_global_superstore_latest_week.ย If the data is current, the value inย result_global_superstore_latest_weekย will be โ€˜-1โ€™. If the data hasnโ€™t been updated this week, then the calculated value will be โ€˜-2โ€™.

Figure 13: Defining a runtime variable for testing the age of the data table.

In this step we select the test variableย result_global_superstore_latest_weekย created in the previous step, and select โ€˜integerโ€™ under the โ€˜Type Conversionโ€™ section. The comparison operator is โ€˜==โ€™ (test if one value equals another) and โ€˜Control Valueโ€™ is a โ€˜-1โ€™. In other words, the test verifies thatย result_global_superstore_latest_weekย is equal to โ€˜-1โ€™. If it is any other value, the test will fail.ย  Next, we update the changes and run the Variation.

Figure 14: Testing if result_global_integer is equal to โ€˜-1โ€™.

Click the linkedย OrderRun ID, on theย Ordersย screen. It takes you to theย OrderRun Detailsย screen.

Figure 15: The Order results table

Scroll below the graph and expand theย Test Resultsย section. Theย Tests: Passedย section shows one test for the Node โ€˜Global_Superstoreโ€™, named โ€˜test_global_superstore_latest_weekโ€™, and the values in that test. We can see that the test passed because theย result_global_superstore_latest_weekย was equal to โ€˜-1โ€™.

Figure 16: Test results after executing the Node Global_Superstore

What happens if we run this test on a data table with stale data? We update the raw data file by deleting the sales orders from the most recent week. The test should produce a warning. Without making any changes to the test that is already in place, let us run the Variation and see what happens.

Figure 17: The test correctly identifies stale data and as configured, issues a warning

Recall that we selected the โ€˜Warnโ€™ option in the failure action field on the test tab. In the test results section on the Order details page, we can see that the test did not fail or log results, but it did produce a warning. Theย Tests: Warningย shows one test for the Node โ€˜Global_Superstoreโ€™, named โ€˜test_global_superstoreโ€™, and the values in that test. The valueย result_global_superstore_latest_weekย was equal to โ€˜-2โ€™ which correctly triggered the warning.

Conclusion

The DataKitchen Platform provides a straightforward way to write powerful data tests that verify the validity of data. DataKitchen users interact with a straightforward UI able to abstract the complexity of a heterogeneous toolchain, characteristic of most data pipelines. The UI enables the user to easily specify test conditions, reporting and conditional actions.ย  Tests prevent erroneous or missing data from impacting the quality of end deliverables which ultimately reduces unplanned work that diminishes productivity.

To learn more about DataOps testing, visit our blog,ย Add DataOps Tests for Error-Free Analytics.

Sign-Up for our Newsletter

Get the latest straight into your inbox

Open Source Data Observability Software

DataOps Observability: Monitor every Data Journey in an enterprise, from source to customer value, and find errors fast! [Open Source, Enterprise]

DataOps Data Quality TestGen: Simple, Fast Data Quality Test Generation and Execution. Trust, but verify your data! [Open Source, Enterprise]

DataOps Software

DataOps Automation: Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change. [Enterprise]

recipes for dataops success

DataKitchen Consulting Services


Assessments

Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Data & Analytics Platform for Pharma

Get trusted data and fast changes to create a single source of truth

 

dataops-cookbook-download

DataOps Learning and Background Resources


DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!

 

DataKitchen Basics


About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts

Careers

Come join us!

Contact

How to connect with DataKitchen

 

DataKitchen News


Newsroom

Hear the latest from DataKitchen

Events

See DataKitchen live!

Partners

See how partners are using our Products

 

Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.