Evaluating Machine Learning Models with MLOps and the DataKitchen Platform

Most projects naively assume that most of the time and resources will be spent in the “black box,” building the machine learning (ML) model, whereas a majority of the project time is actually needed in the green boxes – the ML system.

Data Science workflows traditionally follow the trajectory of the path shown in Figure 1. Most projects naively assume that most of the time and resources will be spent in the “black box,” building the machine learning (ML) model, whereas a majority of the project time is actually needed in the green boxes – the ML system. 


Figure 1: A traditional data science workflow often focuses exclusively on the model and neglects the system.


After building a model, it is typically deployed into production using slow, inflexible, disjointed, manual processes. Google classifies businesses practicing these rudimentary methods as MLOps Level 0.  Characteristics include:

DataOps methods, backed by the DataKitchen DataOps Platform, enable enterprises to address each of these Level 0 constraints. In machine learning contexts, DataOps practices are also known as MLOps or ModelOps. Continuous Delivery is one of the most challenging aspects of MLOps.  In this post, we’ll focus on the “Evaluate Model” step shown in Figure 1. 

In a traditional SDLC process, after a new chunk of code or a feature has been implemented, it undergoes DevOps automated testing. Tests defined in the CI/CD pipeline check if the code is deployable or not. A software application is tested once for each deployment of new code.

Machine learning systems differ from traditional software applications in that ML systems depend on data. A predictive ML model undergoes periodic retraining and redeployment as new data becomes available. Each time the model is updated, it must undergo testing before it is deployed. While a traditional software application could theoretically be deployed and forgotten, ML systems undergo a continuous cycle of periodic retraining, retesting and redeployment for as long as the model is in production. Evaluating model accuracy is very important in an ML continuous deployment. If an MLOps deployment determines that an updated model falls short of its target accuracy, then the orchestration can halt the deployment of the updated model or take some other corrective action. Designing an ML system that handles all of the contingencies of an ML continuous deployment may sound like a substantial undertaking, but it is quite simple when you use the DataKitchen DataOps Platform.


ML System Evaluation with the DataKitchen Platform

Imagine an enterprise that has deployed an ML system that forecasts sales of products. Figure 2 shows an end-to-end orchestrated pipeline which performs the following actions:

  • Transfers data files from an SFTP server to AWS S3.
  • Loads and validates the data from those files into a star schema in AWS Redshift.
  • Forecasts sales using the SARIMA time series model defined in the Jupyter Notebook inspired from this github repository.
  • Updates the data catalog.
  • Publishes detailed reports in Tableau server.


Figure 2: The graph represents an orchestrated ML system that forecasts product sales.


Suppose the enterprise wishes to update the sales forecasting models quarterly. The DataKitchen Platform makes it easy to evaluate the updated model by extracting test metrics from the ML_Sales_Forecasting step, as shown in Figure 2. The deployed system contains three instances of a time series model, which forecast sales for three product categories, i.e., Furniture, Technology and Office Supplies. When new quarterly data becomes available, the model instances are updated. We can configure the DataKitchen Platform orchestration to evaluate, log and take action based on the updated model’s performance.  


Exporting and Testing Metric Values

An Order is scheduled within the DataKitchen Platform to calculate the Root Mean Square Error (RMSE) after each batch of quarterly data becomes available. The RMSE metric enables DataKitchen orchestration to compare current model performance, using the most recent data, with historical results.



Figure 3: The DataKitchen Platform exports scalar values which are used in data pipeline tests.


The DataKitchen Platform provides the option of exporting scalar values of variables after a script has completed execution. In Figure 3, we export RMSE values for total sales, furniture, technology and office supply sales. In general, exported values can be numeric, text, boolean or date type.


Figure 4: The DataKitchen UI enables the user to define tests that run as part of the data pipeline.


After the model metrics have been exported, they can be used to formulate tests as shown in Figure 4, where we conduct a simple test to see if the exported RMSE value is lower than the control metric (5000) as specified in the control value field.

If this test fails, then we have the option to either “LOG” the value, send a “WARN” message or “STOP” the execution of the recipe (orchestrated sequence) to prevent it from moving on to the next step.


Model Governance with DataKitchen

The DataKitchen Platform stores the values of every exported test metric variable in a MongoDB backend. Historical order-run logging provides two benefits. First, it enables the Platform orchestration to natively compare a current variable value with a historical value. This feature, coupled with the test-failure control actions specified above, acts as an automated QA mechanism. It checks a model relative to its predecessor and prevents inferior models from being deployed to the production environment.

The second benefit relates to statistical process control (SPC) applied to exported metrics. The Platform provides native graphs of exported variable values over time. Example SPC graphs can be seen in Figure 5.:



Figure 5: The DataKitchen Platform natively displays statistical process control graphs.


The graphs provide a straightforward assessment of model performance. The Y-axes of the above graphs represent RMSE error metric values, whereas the X-axes represent the timestamp of each run (i.e., when new quarterly data was made available).  The green-colored circles represent the data points, whereas the line composed of small black circles represents the control metric value. If the RMSE exceeds the control value, a message or action is triggered.

Note that in the technology_rmse_metric graph, there was one instance on July 27 at 1:33:01 PM when the RMSE value crossed the predefined threshold of 5000, raising a warning message that can be visibly seen by the yellow triangular tick.

These visuals are provided natively within the DataKitchen Platform, but you also have the option to reproduce them in any visualization tool of your choice, for example, Tableau, which can be seen in Figure 6.:


Figure 6: The SPC data easily exports to 3rd party visualization tools.



The DataKitchen Platform makes it simple to apply statistical process control to machine learning models. We showed how a scheduled orchestration calculates and exports metrics that evaluate model performance over time. With the DataKitchen DataOps Platform, data scientists can easily implement an automated MLOps and Model Governance process, thereby ensuring model quality and delivering a more robust ML pipeline.

About the Author

Abhinav Tiwari
Abhinav Tiwari is Senior DataOps Implementation Engineer at DataKitchen who has worked extensively in the data analytics domain and is also interested in Machine Learning systems.

Sign-Up for our Newsletter

Get the latest straight into your inbox

Open Source Data Observability Software

DataOps Observability: Monitor every Data Journey in an enterprise, from source to customer value, and find errors fast! [Open Source, Enterprise]

DataOps TestGen: Simple, Fast Data Quality Test Generation and Execution. Trust, but verify your data! [Open Source, Enterprise]

DataOps Software

DataOps Automation: Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change. [Enterprise]

recipes for dataops success

DataKitchen Consulting Services


Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Pharma Agile Data Warehouse

Get trusted data and fast changes from your warehouse



DataOps Learning and Background Resources

DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!


DataKitchen Basics

About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts


Come join us!


How to connect with DataKitchen


DataKitchen News


Hear the latest from DataKitchen


See DataKitchen live!


See how partners are using our Products


Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.