Run Analytics Seamlessly Across Multi-Cloud Environments with DataOps

by | Jun 19, 2020 | Blog

Data analytics is performed using a seemingly infinite array of tools. In larger enterprises, different groups can choose different cloud platforms. Multi-cloud or multi-data-center integration can be one of the greatest challenges to an analytics organization. DataKitchen works across different tool platforms, enabling groups using different clouds to work together seamlessly. There’s nothing special about clouds – DataKitchen treats on-prem as just another toolchain environment so this discussion also applies to hybrid clouds.

image - multicloud challenges

Figure 1: A multi-cloud environment creates challenges for an analytics organization.

We recently encountered an enterprise with two groups that use different cloud platforms (see Figure 1). One group consists of data engineers utilizing tools like Talend, Python, Redshift and S3 on an Amazon (AWS) cloud platform. The second group consists of data scientists who use Google Cloud Storage (GCS), BigQuery and Python on a Google cloud platform (GCP). It’s very difficult to integrate analytics across a heterogeneous environment such as these. Note, we are highlighting Amazon and Google in this example, but the same difficulties hold true for other cloud and on-prem technologies.

In the AWS environment, the data flows through three processing steps. The results are fed to the Google cloud where three more steps are performed before charts, graphs and visualizations are delivered to users. In technology-driven companies, tools, workflows and incentives tend to drive people into isolated silos. It’s very hard to keep two teams such as these coordinated. How do you balance centralization and freedom? In other words, how do you keep control over the end-to-end process without imposing bureaucracy that stifles innovation? 

image - multicloud challenges - tests (1)

Figure 2: In a heterogeneous architecture, the two halves of the solution must work together to ensure data quality.

Figure 2 illustrates a few challenging aspects of heterogeneous-architecture integration. The two halves of the solution must work together to ensure data quality. Can test metrics be passed from one environment to the other so that data can be checked against statistical controls and business logic? Can the two groups coordinate alerts that notify the right development team of an issue that requires someone’s attention?

The two teams must also engage in process coordination. How do the teams perform impact analysis, i.e., if someone makes a change, how does it affect everyone else? When one team makes an architectural change, is the other team aware? The two teams may have different iteration cadences or business processes, which keep them out of lock step. How can the two groups work together seamlessly while maintaining their independence? 

Multi-Cloud with the DataKitchen Platform

The challenges of integrating multiple environments are greatly simplified when using the DataKitchen Platform. DataKitchen creates a coherent framework which interoperates with the two technical environments respectively. The DataKitchen Platform supports orchestrated data pipelines called “Recipes.” 

Figure 3  shows a Recipe (data pipeline) that consists of four steps:

  1. Call-aws – Execute analytics on the AWS platform
  2. Move-data – move data from AWS to GCP
  3. Verify-move – test data just moved
  4. Call-gcp – Execute analytics on the GCP platform

The call_aws and call-gcp nodes are different from the others – they are Recipes that are being used as sub-components of a top-level Recipe.  We call these types of Recipes, “Ingredients.” Both call_aws and call-gcp are Ingredients that when called, execute on the respective cloud platforms.

image - multicloud overal recipe labelled

Figure 3: A Recipe that uses Ingredients to execute on respective cloud platforms.

The DataKitchen architecture abstracts the interface to target toolchains. It references the tools in each of the technical environments using variables. During automated orchestration of the pipelines, references to the variables are overridden and diverted to a specific tool instantiation in the correct technical environment. The target environment can then be changed by modifying the overrides. DataKitchen embeds software agents in the technical environments to help facilitate this interface modularity. 

DataKitchen executes testing within each Ingredient pipeline and across the cloud environments. Once the top-level Recipe is in place it serves as a robust and modular interface between the two groups. Each group can work independently, focusing on the local call_aws and call_gcp Ingredients respectively. The two groups do not have to understand each other’s toolchain or environment. Data flowing between the two cloud platforms is compartmentalized and error-checked in the move-data and verify-move steps of the top-level Recipe.

The DataKitchen Platform serves as a unifying platform for the two cloud environments. It orchestrates both the local and global data pipelines. It tests data at each processing step and across the end-to-end data flow so that quality is maintained. DataKitchen supports the creation and management of testing across all cloud environments, simplifying end-to-end testing. It liberates the two teams from having to understand each other’s toolchains or workflow processes. With DataKitchen in place, the teams work efficiently and independently, without hindrance from quality, process coordination or integration issues.

To learn more on this topic, read the White Paper, Your Cloud Migration is Actually an Agility Initiative.

Sign-Up for our Newsletter

Get the latest straight into your inbox

Open Source Data Observability Software

DataOps Observability: Monitor every Data Journey in an enterprise, from source to customer value, and find errors fast! [Open Source, Enterprise]

DataOps Data Quality TestGen: Simple, Fast Data Quality Test Generation and Execution. Trust, but verify your data! [Open Source, Enterprise]

DataOps Software

DataOps Automation: Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change. [Enterprise]

recipes for dataops success

DataKitchen Consulting Services


Assessments

Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Pharma Agile Data Warehouse

Get trusted data and fast changes from your warehouse

 

dataops-cookbook-download

DataOps Learning and Background Resources


DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!

 

DataKitchen Basics


About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts

Careers

Come join us!

Contact

How to connect with DataKitchen

 

DataKitchen News


Newsroom

Hear the latest from DataKitchen

Events

See DataKitchen live!

Partners

See how partners are using our Products

 

Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.