← Back to Blog

DataOps is Not Just a DAG for Data

Written by DataKitchen Marketing Team on February 25, 2020

Orchestration
DataOps is Not Just a DAG for Data

We often hear people say that DataOps is just automating the data pipeline — using orchestration to execute directed acyclic graphs (DAGs). Enterprises may already use orchestration tools (Airflow, Control-M, etc.) and mistakenly conclude that they have DataOps covered. While it is true that automated orchestration is a critically important element of DataOps, the methodology is about much more than orchestration.

What is Orchestration?

Many data professionals spend the majority of their time on inefficient manual processes and fire drills. They could do amazing things, but not while held back by bureaucratic processes. An organization is not going to surpass competitors and overcome obstacles while its highly trained data scientists are hand editing CSV files. The data team can free itself from the inefficient and monotonous (albeit necessary) parts of the data job using an automation technology called orchestration.

Think of your analytics development and data operations workflows as a series of steps that can be represented by a set of directed acyclic graphs (DAGs). Each node in the DAG represents a step in your process. For example, data cleaning, ETL, running a model, or even provisioning cloud infrastructure. You may execute these steps manually today. An orchestration tool runs a sequence of steps for you under automated control. You need only push a button (or schedule a job) for the orchestration tool to progress through the series of steps that you have defined. Steps can be run serially, in parallel or conditionally.

Figure 1: The set of steps that produce analytics represented as a directed acyclic graph (DAG)

There are numerous data pipeline orchestration tools that manage processes like ingesting, cleaning, ETL, and publishing data. There are also DevOps tools that focus on orchestrating development activities like spinning up a development environment. For examples of these tools, check out Paolo Di Tommaso’s list of awesome data pipeline and build automation tools. DataOps doesn’t replace these tools; it works alongside them. The typical enterprise deploys analytics using numerous data platforms, tools, languages and workflows. DataOps unifies this cacophony into coherent pipelines. In DataOps, orchestration must automate both the data pipeline (“Value Pipeline”) and the analytics development pipeline (“Innovation Pipeline”). If you observe organizational workflows, staffing and the activities of self-service data users, you’ll likely see that these two major pipelines are actually comprised of numerous and myriad smaller pipelines flowing in every direction. However, to keep it simple, we’ll speak of the Value and Innovation Pipelines in broad terms.

Figure 2: DataOps must orchestrate both the Value and Innovation Pipelines – data operations and analytics development.

Value Pipeline Orchestration

The Value Pipeline extracts value from data. Data enters the pipeline and progresses through a series of stages – access, transforms, models, visualizations, and reports. When data exits the pipeline, in the form of useful analytics, value is created for the organization.

In most enterprises, the Value Pipeline is not one DAG; it is actually a DAG of DAGs. The figure below shows various groups within the organization delivering value to customers. Each group uses its preferred tools. The toolchain may include one or many orchestration tools. DataOps complements these underlying tools through meta-orchestration – orchestrating the DAG of DAGs. DataOps embraces and harmonizes the heterogeneous tools environment typical of most data organizations.

Figure 3: DataOps orchestrates the DAG of DAGs that is your data pipeline

Innovation Pipeline Orchestration

It’s not enough to only apply orchestration to the data operations pipeline. DataOps also orchestrates the Innovation Pipeline, which seeks to enhance and extend analytics by implementing new ideas that yield analytic insights. It is essentially the new analytics development processes and workflows. DataOps orchestrates the Innovation Pipeline based on the DevOps model of continuous deployment. As shown in the figure below, each team in the data organizations that produces analytics has its own workflow processes that reflect its unique structure. This includes self-service users who are both consumers and producers of analytics. Like the data pipeline, the Innovation Pipeline is not one DAG; it is a DAG of DAGs.

As mentioned previously, each DAG orchestration relies upon its preferred tools. DataOps meta-orchestration doesn’t impose any restrictions on the tools that may be chosen for any given task.

Figure 4: The development of new analytics by various teams and self-service users is a DAG of DAGs that flow into production.

Orchestration Alone is Not Enough

Automated orchestration is a key part of an overall DataOps initiative, but by itself, it can’t fully deliver the benefits of DataOps. For example, data operations could be fully automated, but without tests and process controls, data and code errors could propagate into analytics wreaking havoc. A data team that is constantly fighting fires is unable to achieve peak productivity. A comprehensive DataOps initiative requires several key methods and processes in addition to orchestration.

The End-to-End Data Life Cycle of DataOps

Even as we describe the unique requirements of enterprise data orchestration, it is important to keep a broad understanding of DataOps in mind. DataOps governs the end-to-end life cycle of data. Here are some key ingredients of DataOps:

In summary, DataOps governs how analytics are planned, developed, tested, deployed and maintained.

Figure 5: Data professionals access the capabilities of DataOps using a virtual environment called a “Kitchen.”

DataOps Kitchens

It might be challenging for a person in a non-DataOps enterprise or an organization that is only doing orchestration to understand how DataOps accomplishes the aims that we have described. DataOps is not a new tool that replaces your existing tools. It is a set of tools and methods that improve the way you use your existing tools. Many of the DataOps capabilities that we described above are localized in a new virtual construct that we call a “Kitchen.” In DataOps, Kitchens are the controlled and repeatable environments where data professionals do their work. Data professionals access the capabilities of DataOps using a Kitchen.

A Kitchen reflects the production technical environment and integrates and enables data testing, code quality control, process controls, version control, parallel development, environments, toolchains, component reuse, containers, conditional execution, vault security, workflow management and yes, orchestration of data operations and analytics-development pipelines. A Kitchen is a repeatable environment that promotes sharing among team members. Kitchens may be easiest to understand in the context of a data team working to produce new analytics on a tight schedule – see our user story here.

Kitchens Enable DataOps

Kitchens enable DataOps to coordinate tasks between and within teams. Kitchens are the virtual environments where all DataOps functionality comes together. Using Kitchens, DataOps delivers several important benefits:

DataOps accomplishes these goals by streamlining the end-to-end data life cycle. It provides a way for enterprises to cope with massive amounts of data, real-time application requirements, and organizational/workflow complexity. Orchestration plays a key role in DataOps, but it is only one tool of many available in DataOps Kitchens. DataOps provides the roadmap for data organizations to adapt their structure, tools, and methods to thrive in increasingly complex and competitive markets.

For more information on the DataKitchen Platform, see our whitepaper, Meeting the Challenges of Exponential Data Growth.

DataKitchen Marketing Team

The DataKitchen marketing team curates industry news, resources, and thought leadership on DataOps, data quality, and data observability.