Orchestrating StreamSets with the DataKitchen DataOps Platform

by Priyanjna Sharma | Mar 9, 2020 | Blog, Master Chef How-To Series, Orchestration

The cacophony of tools and mission-critical deliverables are the reason behind the high complexity of modern-day data organizations. Data groups include a wide range of roles and functions that are intricately woven together by their “Data”. Teams include data scientists, business analysts, data analysts, statisticians, data engineers, and many more people. Each of these roles has a unique mindset, specific goals, distinct skills, and a preferred set of tools. It’s not news that everyone loves their tools and are seldom willing to give them up.

DataOps orchestrates your end-to-end multi-tool, multi-environment pipelines – from data access to value delivery. Using the DataKitchen DataOps Platform, you can continue to use the tools you love. For example, one popular tool is StreamSets, which allows teams to build and operate smart data pipelines for data ingestion and ETL. However, an ETL tool on its own will not deliver the benefit of DataOps. Although tools like StreamSets play an important role in the data operations pipeline, they do not ensure that each step is executed and coordinated as a single, integrated, and accurate process or help people and teams better collaborate. A platform like DataKitchen is needed to orchestrate StreamSets (or any ETL tool) as part of your end-to-end data pipeline. StreamSets and DataKitchen are complementary tools. Orchestrating StreamSets with DataKitchen enables you to achieve key elements of DataOps, such as the ability to:

Manage and seamlessly deploy StreamSets pipelines easily across different release environments (DEV/PROD/QA).
Automate the testing and monitoring of StreamSets pipelines using the python-sdk or API exposed by StreamSets.
Easily design data pipelines (Recipes) that include StreamSets along with any of your favorite tools.
Manage and control different versions of the StreamSets pipeline configuration.
Facilitate collaboration amongst users by allowing easy sharing of resources (Ingredients) including the StreamSets pipeline in their respective isolated work environments (Kitchens).

How It Works

For the infrastructure setup, StreamSets runs inside of a docker container on an AWS EC2 instance. DataKitchen interacts with StreamSets using the StreamSets CLI. StreamSets also has a subscription-based python-sdk which may be leveraged to exchange some information that can later be used to configure some automated QA tests in the DataKitchen DataOps platform.

The picture below shows a basic recipe built in less than a day that imports an existing StreamSets pipeline, starts the pipeline, triggers it by mimicking an event and in the end stops the pipeline. All the tasks mentioned above are performed in individual nodes each of which can be configured to add automated QA tests that can either pass/fail or issue a warning upon execution.

Step 1: Import the StreamSets pipeline

This step is performed by the node import_streamsets_pipeline as shown in the picture above. It uses a docker container to log into the AWS EC2 machine and import the StreamSets pipeline as defined by the JSON extract provided by the user using the StreamSets CLI. The JSON extract can be exported from StreamSets and passed into the docker node as a source file. This step also makes sure that the StreamSets pipeline is versioned in Github.

bin/streamsets cli -U https://localhost:18630 store import -n “$pipelineID” -f <file_name>.json

The sample StreamSets pipeline we designed for demo (as shown in picture below) moves a file from local directory (local directory on the ec2 instance) to an Amazon S3 bucket.

Step 2: Start the StreamSets Pipeline

The start_streamsets_pipeline node uses a docker container to log into the AWS EC2 machine and interact with StreamSets using the StreamSets CLI to start the StreamSets pipeline.

bin/streamsets cli -U https://localhost:18630 manager start –name <pipelineID>

Step 3: Create a new file

The create_new_file node mimics the arrival of a new file in the local directory by creating a new file in the ec2 instance to test if the StreamSets pipeline is operating as expected

Step 4: Stop the StreamSets Pipeline

The stop_streamsets_pipeline node uses a docker container to log into the AWS EC2 machine and interact with StreamSets using the StreamSets CLI to stop the StreamSets pipeline.

bin/streamsets cli -U https://localhost:18630 manager stop –name <pipelineID>

By following similar steps, any tool can be easily orchestrated with the DataKitchen DataOps Platform. To learn more about how orchestration enables DataOps, please visit our blog, DataOps is Not Just a DAG for Data.

About the Author

Priyanjna Sharma

Priyanjna Sharma is a Senior DataOps Implementation Engineer at DataKitchen. Priya has more than 10 years of experience in IT, Data Warehousing, Reporting, Business Intelligence, Data Mining and Engineering, and DataOps.

You might also like:

← Previous Blog Next Blog →

Sign-Up for our Newsletter

Get the latest straight into your inbox

Orchestrating StreamSets with the DataKitchen DataOps Platform

How It Works

Step 1: Import the StreamSets pipeline

Step 2: Start the StreamSets Pipeline

Step 3: Create a new file

Step 4: Stop the StreamSets Pipeline

You might also like:

Sign-Up for our Newsletter

Resources

Company

Connections

DataKitchen Consulting Services

Identify obstacles to remove and opportunities to grow

Deliver faster and eliminate errors

Educate, align, and mobilize

Get trusted data and fast changes to create a single source of truth

By Team

Our software delivers trusted insight faster:

By Buzzword

Our software enables these ideas:

By Use Case

Our software enables these:

DataKitchen Resources

DataOps Learning and Background Resources

DataOps Observability basics

Why it matters!

All the basics of DataOps

Get certified in DataOps

Assess your DataOps Readiness

Thirty thousand signatures can't be wrong!

DataKitchen Basics

All the basics on DataKitchen

Who we are; Why we are the DataOps experts

Come join us!

How to connect with DataKitchen

DataKitchen News

Hear the latest from DataKitchen

See DataKitchen live!

See how partners are using our Products

DataOps Observability Software

New! DataOps TestGen Software

DataOps Automation Software