Add DataOps Tests to Deploy with Confidence

DataOps is the art and science of automating the end-to-end life cycle of data-analytics to improve agility, productivity and reduce errors to virtually zero. The foundation of DataOps is orchestrating three pipelines: data operations, analytics development/deployment and sandbox environments, which enable parallel development and the seamless integration of new analytics into production. 

Automation enables the data team to increase its productivity by order(s) of magnitude, but velocity can’t be improved if the orchestrated pipelines execute with errors. Integrating tests into the data-analytics pipelines provides the quality control needed to unleash the power of DataOps automation. We’ve recently written about data production pipeline tests and provided a tutorial on how simple it is to add data tests using the DataKitchen Platform UI. Now we tackle the concept of testing in the context of new analytics development and continuous integration.

Lessons from Software Development

Data organizations know they need to be more agile. In order to succeed, they need to learn lessons from software development organizations. In the 1990’s software teams adopted Agile Development and committed to producing value in rapid, successive increments. If you think a little about process bottlenecks in software development, Agile moved the bottleneck from the development team to the release team. For example, if the quality assurance (QA), regression and deployment procedures require three months, it hardly makes sense to operate with weekly dev sprints.

The solution is DevOps which automates the release process. Once an automated workflow is in place, releases can occur in minutes or seconds. Companies like Amazon perform code updates every 11.6 seconds. That’s millions of releases per year. Data analytics teams can achieve the same velocity by adopting DataOps which incorporates the DevOps methodology (along with other techniques that are specific to data analytics).

In DataOps, automated testing is built into the release and deployment workflow. The image below is the Innovation Pipeline which represents the flow of new ideas into development and then deployment. Testing proves that analytics code is ready to be promoted to production.
Test before promoting to production

Figure 1: Tests should be run in development before deploying new analytics to production.

Benefits of Testing

Testing produces many benefits. It increases innovation by reducing the amount of unplanned work that derails productivity. Tests catch errors that undermine trust in data and in the data team. This reduces stress and makes for happier data engineers, scientists and analysts. 

As a group, technical professionals tend to dislike making mistakes, especially when the errors are splashed across analytics that are used by the enterprise’s decision makers. Testing catches embarrassing and potentially costly mistakes so they can be quickly addressed.

Incorporate Testing into Each Step

Testing enables the data team to deploy with confidence. Testing must be incorporated into each processing stage of an analytics pipeline. Every processing or transformation step should include tests that check inputs, outputs and evaluate results against business logic.

Below are some example tests:

  • The number of customers should always be above a certain threshold value
  • The number of customers is not decreasing
  • The zip code for pharmacies has five digits.

SQL query

Figure 2: Every processing or transformation step should include tests that check inputs, outputs and evaluate results against business logic.

 

The image below shows the many different kinds of tests that should be performed. We explain each of these types of tests in our recent blog on impact view.Types of tests

Figure 3: A broad set of tests can validate that the analytics work and fit into the overall system.

Tests Lighten Bureaucracy

When updates create outages, some enterprises react by instituting onerous procedures for releasing changes. For example:

  1. Add the new development task to the project plan. Wait for approval.
  2. Request access to a new data set and wait for IT to grant.
  3. Write a functional specification and submit the proposed change to the Impact Review Board (IRB). Wait for approval.
  4. Implementation and unit testing
  5. Manual system QA based off a checklist 
  6. Addition of the new feature to the production deployment calendar
  7. Hold your breath and deploy

The above procedure could take months to complete. One of the bottlenecks is impact review. The Impact Review Board (IRB) is a meeting of the company’s senior data experts. They review all changes to data operations. How does a change in one area affect other areas? Naturally, the company’s most senior technical contributors are extremely busy and subject to high-priority interruptions. Sometimes it takes weeks to gather them in one room. It can take additional weeks or months for them to approve a change. Note that the purpose of an IRB is to slow down and control change. It diminishes the ability of individuals to work in parallel. 

Let’s contrast the above heavy-weight process with a lightweight DataOps development and deployment process:

  1. Data scientist spins up a sandbox virtual environment with test data and a clone of the data operations pipeline
  2. Implementation and testing
  3. Run impact review tests
  4. Push to automated deployment

Implementation time (step 2) certainly varies, but the rest of this process could take only hours or days. The collective knowledge of the impact review board is represented by the impact review test suite which is run by any data analyst at any time. It’s like having a personal, on-demand IRB. With automated testing and deployment, a new employee who is becoming familiar with the enterprise’s data architecture can deploy new analytics just as easily as a seasoned veteran. The least experienced member of the team deploys like the most experienced member. 

DataOps shortens the cycle time of a single development and deployment, but it has also removed several bottlenecks that diminished overall team productivity. Development of analytics is now parallelized as data scientists code, test, and deploy independently.

Testing Breadth

Some data professionals perform unit tests and stop there. If they are adding 200 new lines of SQL, they test those 200 lines separate from the system. However, experience shows that a full breadth of tests are needed. For example, performance tests should verify that the analytics meet performance targets under actual system loading. End-to-end tests validate that a change to the system doesn’t create unwanted side effects and confirm that new or updated analytics fit into the production analytics pipeline. Confidently deploying to production encourages autonomy and ultimately improves velocity by lessening bureaucracy and the need for manual review. 

Test Metrics

Test metrics can help determine whether test coverage is adequate. The diagram below shows the number of tests in each node of execution in an analytics pipeline. As a starting point, each node should have multiple tests. The number of tests should reflect the complexity of the system. Tests should cover all nodes and data sets, but should not overwhelm your processing resources.

Automated tests embedded in recipes

Figure 4: This graph shows how many tests are being run at each Node in the analytics pipeline.

 

Metrics can also track the overall number of tests and the tests per execution Node in the end-to-end system. In the diagram below, we see that even though the overall number of tests is increasing, the average number of tests per node has drifted downward. This feedback may encourage the team to slightly increase its test production.

 

graph test coverage

Figure 5: Graphs and metrics help determine whether the analytics are being adequately tested.

Testing Code and Data

Tests in the development workflow focus on verifying code correctness. In development (the Innovation Pipeline) the developer usually works with a fixed data set. Code changes as the data analyst makes changes, but data remains the same. The table below draws the distinction between the Innovation Pipeline (with fixed data and variable code) and the Value Pipeline (data operations) which has fixed code and variable data flowing through analytics transformations. Testing is needed in both the Innovation and Value Pipelines.

Code Data Fixed Variable

Figure 6: In the Value Pipeline (data operations) code is fixed and data flows continuously. In the Innovation Pipeline (development), test data is kept constant while code changes.

 

Tests written by the data analyst or data scientist fulfill a dual role. Most of the tests that are written to validate code during the analytics development phase are promoted into production with the analytics to verify and validate the operation of the data pipeline.

Tests dual role

Figure 7: Most tests written for the Innovation Pipeline to verify code can be used in the Value Pipeline to test data.

 

Testing Enables DataOps

Automated testing is one of the critical elements that enable development teams to push their work into production without having to engage in lengthy manual QA. In production, tests ensure that data flowing through analytics is error free and that changes to data sources or databases do not break analytics. The time spent on testing is multiplied many times in terms of increased productivity. Tests enable automated deployment which saves a lot of manual effort. Tests also monitor the data pipeline which reduces unplanned downtime. Testing is a cornerstone of DataOps which brings these benefits and many others to data teams.

To learn more about the benefits of testing in the Production (aka Value) pipeline, read our blog, Add DataOps Tests for Error-Free Analytics.

Sign-Up for our Newsletter

Get the latest straight into your inbox

Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.

recipes for dataops success

DataKitchen Consulting Services


Assessments

Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Pharma Agile Data Warehouse

Get trusted data and fast changes from your warehouse

 

dataops-cookbook-download

DataOps Learning and Background Resources


DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!

 

DataKitchen Basics


About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts

Careers

Come join us!

Contact

How to connect with DataKitchen

 

DataKitchen News


Newsroom

Hear the latest from DataKitchen

Events

See DataKitchen live!

Partners

See how partners are using our Products