Orchestrating StreamSets with the DataKitchen DataOps Platform

The cacophony of tools and mission-critical deliverables are the reason behind the high complexity of modern-day data organizations. Data groups include a wide range of roles and functions that are intricately woven together by their “Data”. Teams include data scientists, business analysts, data analysts, statisticians, data engineers, and many more people.  Each of these roles has a unique mindset, specific goals, distinct skills, and a preferred set of tools. It’s not news that everyone loves their tools and are seldom willing to give them up.

 

Fragmented Toolchain

 

DataOps orchestrates your end-to-end multi-tool, multi-environment pipelines – from data access to value delivery. Using the DataKitchen DataOps Platform, you can continue to use the tools you love.  For example, one popular tool is StreamSets, which allows teams to build and operate smart data pipelines for data ingestion and ETL.  However, an ETL tool on its own will not deliver the benefit of DataOps.  Although tools like StreamSets play an important role in the data operations pipeline, they do not ensure that each step is executed and coordinated as a single, integrated, and accurate process or help people and teams better collaborate. A platform like DataKitchen is needed to orchestrate StreamSets (or any ETL tool) as part of your end-to-end data pipeline.  StreamSets and DataKitchen are complementary tools. Orchestrating StreamSets with DataKitchen enables you to achieve key elements of DataOps, such as the ability to: 

  • Manage and seamlessly deploy StreamSets pipelines easily across different release environments (DEV/PROD/QA).
  • Automate the testing and monitoring of StreamSets pipelines using the python-sdk or API exposed by StreamSets.
  • Easily design data pipelines (Recipes) that include StreamSets along with any of your favorite tools.
  • Manage and control different versions of the StreamSets pipeline configuration.
  • Facilitate collaboration amongst users by allowing easy sharing of resources (Ingredients) including the StreamSets pipeline in their respective isolated work environments (Kitchens).

How It Works

For the infrastructure setup, StreamSets runs inside of a docker container on an AWS EC2 instance. DataKitchen interacts with StreamSets using the StreamSets CLI.  StreamSets also has a subscription-based python-sdk which may be leveraged to exchange some information that can later be used to configure some automated QA tests in the DataKitchen DataOps platform.

The picture below shows a basic recipe built in less than a day that imports an existing StreamSets pipeline, starts the pipeline, triggers it by mimicking an event and in the end stops the pipeline.  All the tasks mentioned above are performed in individual nodes each of which can be configured to add automated QA tests that can either pass/fail or issue a warning upon execution.

StreamSets2

Step 1:  Import the StreamSets pipeline

This step is performed by the node import_streamsets_pipeline as shown in the picture above. It uses a docker container to log into the AWS EC2 machine and import the StreamSets pipeline as defined by the JSON extract provided by the user using the StreamSets CLI. The JSON extract can be exported from StreamSets and passed into the docker node as a source file.  This step also makes sure that the StreamSets pipeline is versioned in Github.

 

bin/streamsets cli -U https://localhost:18630 store import -n “$pipelineID” -f <file_name>.json

 

The sample StreamSets pipeline we designed for demo (as shown in picture below) moves a file from local directory (local directory on the ec2 instance) to an Amazon S3 bucket.

 

StreamSets

 

Step 2:  Start the StreamSets Pipeline

The start_streamsets_pipeline node uses a docker container to log into the AWS EC2 machine and interact with StreamSets using the StreamSets CLI to start the StreamSets pipeline. 

bin/streamsets cli -U https://localhost:18630 manager start –name <pipelineID>

Step 3:  Create a new file

The create_new_file node mimics the arrival of a new file in the local directory by creating a new file in the ec2 instance to test if the StreamSets pipeline is operating as expected

Step 4:  Stop the StreamSets Pipeline

The stop_streamsets_pipeline node uses a docker container to log into the AWS EC2 machine and interact with StreamSets using the StreamSets CLI to stop the StreamSets pipeline.

 bin/streamsets cli -U https://localhost:18630 manager stop –name <pipelineID>

By following similar steps, any tool can be easily orchestrated with the DataKitchen DataOps Platform.  To learn more about how orchestration enables DataOps, please visit our blog, DataOps is Not Just a DAG for Data.

About the Author

Priyanjna Sharma
Priyanjna Sharma is a Senior DataOps Implementation Engineer at DataKitchen. Priya has more than 10 years of experience in IT, Data Warehousing, Reporting, Business Intelligence, Data Mining and Engineering, and DataOps.

Sign-Up for our Newsletter

Get the latest straight into your inbox

Data Observability Software

DataOps Observability: Monitor every Data Journey in an enterprise, from source to customer value, and find errors fast! [Open Source, Enterprise]

DataOps TestGen: Simple, Fast Data Quality Test Generation and Execution. Trust, but verify your data! [Open Source, Enterprise]

DataOps Software

DataOps Automation: Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change. [Enterprise]

recipes for dataops success

DataKitchen Consulting Services


Assessments

Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Pharma Agile Data Warehouse

Get trusted data and fast changes from your warehouse

 

dataops-cookbook-download

DataOps Learning and Background Resources


DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!

 

DataKitchen Basics


About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts

Careers

Come join us!

Contact

How to connect with DataKitchen

 

DataKitchen News


Newsroom

Hear the latest from DataKitchen

Events

See DataKitchen live!

Partners

See how partners are using our Products

 

Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.