ย Almost every data analytic tool can be used in DataOps, but some donโt enable the full breadth of DataOps benefits.
DataOpsย views data-analytics pipelines like aย manufacturing processย that can be represented byย directed acyclic graphs. Each node in the graph represents an operation on data as it flows through the pipeline (integration, preprocessing, ETL, modeling, rendering, etc. ).ย In Figure 1 below, we see that the โcheck_data_qualityโ node accepts an input (data), performs some processing (quality tests) and produces an output (test results).
Figure 1: Each step in the data pipeline usually involves an input and processing to create a result or output.
If we view the data pipeline from a tools perspective, we see how data flows from one tool to another (Figure 2). The typical enterprise uses many different tools, so tool integration becomes vitally important. The DataKitchen Platform employs two basic tool integration strategies.
- Native Support โ Integration is achieved through dedicatedย Data Sources and Data Sinks.ย The tool supports a direct connection to the DataKitchen Platform.
- Container Support โย Integration is achieved usingย Containers.ย Containers are a lightweight form of machine virtualization which encapsulates an application and its environment. For more on the ease of using containers with the DataKitchen Platform, see ourย containers blog.
Figure 2: The DataOps pipeline shown from a tools perspective illustrates the importance of tool integration (multiple tool alternatives shown in steps).
Using native-supported or container-supported integration, a DataOps Platform like DataKitchen can interface with any tool. However, DataOps touches on many aspects of workflow, and some tools are better suited to DataOps than others. To further illustrate, we have created a simple rubric that scores the receptiveness of a tool to integration in a DataOpsย orchestrated pipeline.
A high-scoring tool offers support and functionality in four important areas:
- Source code โ A tool that produces source code enables many aspects of DataOps. Source code can beย version controlled, allowing change management, parallel development, andย static testingย (debugging via automated analysis of source code prior to execution).
- Environments โ DataOps supports multipleย environments, for example, development, staging. and production environments. A tool should be able to support segmented access. For example, a Redshift database may be partitioned using clusters, databases and schemas so that users are isolated from one another.
- API โ A tool with an API is easy to orchestrate. An API that is machine callable and supports the loading of code/parameters works well in the continuous deployment methodology that DataOps seeks to create.
- DevOps โ The tool supports being spun-up/shutdown under automatedย orchestration, and it can scale to the desired number of instantiations.
In addition to these four main areas, there are some other characteristics that the rubric scores. The rubric is shown below.ย The higher a tool scores, the easier it will be to integrate to a DataOps Platform.
ย
Rubric for DataOps Ease of Integration
Step 1:ย
If your tool is a programming language or produces code as an artifact, check all that apply.
Category | Sub-Category | Match | Points | DataOps Benefit |
Source Code |
Saves or exports binary, XML, JSON, or source code (e.g., SQL, Python) |
โข | +8 | Version Control,ย Reuse |
Source Code | Produces source code that can be checked into version control, but cannot be auto merged (e.g., binary-format, XML) | โข |
+4 |
Version Control |
Source Code |
Produces source code that is line mergeable (e.g., SQL, Python) |
โข |
+8 |
Version Control,ย Branch And Merge |
Source Code | Code supports static analysis | โข |
+2 |
Step 2:ย
Check all that apply from these sub-categories.
Category | Sub-Category | Match | Points | DataOps Benefit |
Environments (e.g., Prod, QA, Dev) |
The tool can be parameterized or partitioned |
โข |
+8 |
Supports Release Environments,ย Parameterize Processing |
DevOps |
Provides a way to spin up the tool, i.e., create (in a VM, Container, or Machine Image) |
โข |
+8 |
Environments, Containers, Reuse, Orchestration |
DevOps |
Able to be scaled to the number of instances required |
โข |
+4 |
Environments, Reuse, Orchestration |
API |
Offers an API to start the tool |
โข |
+4 |
Orchestration |
API |
Supports an API to stop the tool |
โข |
+4 |
Orchestration |
API |
Supports an API to save and load source code |
โข |
+8 |
Reuse, Orchestration |
API |
Provides the ability to check its status |
โข |
+2 |
Testing, Monitoring |
API |
Enables data to be utilized in tests |
โข |
+16 |
Testing |
Tools that Measure Up
Weโve found that the rubric is most useful when comparing two tools of the same type or category, for example, differentiating between a traditional on-prem database and a cloud database.ย Cloudย databases like Amazon Redshift, Google Big Query, Azure Synapse, or Snowflake can be spun-up on demand for sandboxes and scales to the number of instances required. Cloud databases therefore score higher than alternative databases which lack these features.ย
The tools that score more favorably on the rubric do a better job enabling aspects of DataOps. For example, tools such as Pentaho and SQL Workbench do not support an API and focus on interactive use. SQL Workbench scores higher than Pentaho because it saves SQL that is line mergeable. Pentaho exports XML, which can be stored in source control, but does not easily enable version control, branching and merging, which are essential to parallel development. Pentaho and Tableau are examples of popular tools that support saved/exported files that do not support version control auto-merge. Almost every tool can be used in a DataOps pipeline, but some donโt enable the full breadth of DataOps benefits.
Programming languages, such as Python, R, and SQL, are a special case. They score well in the source code category, but canโt be evaluated in the environment, DevOps or API categories, unless considered in the context of a toolchain or environment. We would say that programming languages are ideally suited to DataOps because they are line mergeable.ย
The good news is that tool vendors increasingly understand that integrating into an orchestrated data pipeline is becoming more important as DataOps automation grows. If a tool that you use (or wish to purchase) scores low on the rubric, you may want to share your observations with the tool vendor.
We welcome comments about our rubric. Please email us atย info@www.datakitchen.ioย or reach out to us on Twitter at @www.datakitchen.io.