From Cattle to Clarity: Visualizing Thousands of Data Pipelines with Violin Charts

What do you do when you have thousands of data pipelines in production? Is there a way that you can visualize what is happening in production quickly and easily?

From Cattle to Clarity: Visualizing Thousands of Data Pipelines with Violin Charts


 

Most data teams work with a dozen or a hundred pipelines in production.  What do you do when you have thousands of data pipelines in production?  How do you understand what is happening to those pipelines?  Is there a way that you can visualize what is happening in production quickly and easily?  We’ve developed a new feature in the enterprise cloud version of our DataOps Observability product that does just that.   It is called the Violin Chart.  We will discuss how and why we use it for data pipeline durations.  Stop fiddling with time!

One of our customers, who runs several thousand pipelines daily, wanted to know more about their pipelines.  They treat their pipelines as cattle, not pets.  The following chart shows run durations. The list of pipelines is on the Y-axis (truncated for privacy) and is sorted so you can easily see the pipelines at the top take the most time to run.  That is useful if you want to focus on what is slow.  What the chart does not show is the trend over time.  What is an effective way to see the trend?  

This article will explain a violin chart and give three examples of how you can quickly spot anomalies with this chart.

What is a violin chart?

Our customer suggested we use a violin chart.  I thought I was up on the latest charts and graphs after working with data (I won’t say how many years) and taking a Tufte on the display of quantitative information course on my birthday (in person when there were 3, not 5, books). “The violin plot was proposed in 1997 by Jerry L. Hintze and Ray D. Nelson as a way to display even more information than box plots [also called box-and-whisker plots], created by John Tukey in 1977. The name comes from the plot’s alleged resemblance to a violin.” [Wikipedia]

What is a violin chart good for?

You can see the minimum and maximum values and a histogram of the values.  The shape of each violin lets you visually spot when something is different.

Real-world examples

Looking at these charts, you can see that the runs on August 5 and August 6 had longer-than-usual durations.

Most of the runs are around 2.5 minutes.  There are outliers of 100 minutes on August 5 and 6.  This should be monitored for future occurrences and can be a cause for investigation.  This is a convenient way to absorb lots of data because there are 3,000 runs of Pipeline 1 on a typical day.

In this next chart of Pipeline 2, the shape of the violin changed on August 26 and 27. At a glance, we can see that there are more longer runs, some significantly longer.  

On August 26, the maximum is about 85 minutes, and on August 27, it is 50 minutes.

In this final chart, the shape of this sporadic job changed on Aug 23.  Usually, this job takes less than 30 seconds.  On 08-23, one run was below 30 seconds, but the others were over 2 minutes.

The violin chart is helpful for not just seeing what jobs took longer than usual. Because the chart groups them and creates a histogram, it is easy to see trends and anomalies and pinpoint areas for further investigation, which can, therefore, improve your Data Operations.

 

 


Contact us today

Sign-Up for our Newsletter

Get the latest straight into your inbox

Open Source Data Observability Software

DataOps Observability: Monitor every Data Journey in an enterprise, from source to customer value, and find errors fast! [Open Source, Enterprise]

DataOps Data Quality TestGen: Simple, Fast Data Quality Test Generation and Execution. Trust, but verify your data! [Open Source, Enterprise]

DataOps Software

DataOps Automation: Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change. [Enterprise]

recipes for dataops success

DataKitchen Consulting Services


Assessments

Identify obstacles to remove and opportunities to grow

DataOps Consulting, Coaching, and Transformation

Deliver faster and eliminate errors

DataOps Training

Educate, align, and mobilize

Commercial Pharma Agile Data Warehouse

Get trusted data and fast changes from your warehouse

 

dataops-cookbook-download

DataOps Learning and Background Resources


DataOps Journey FAQ
DataOps Observability basics
Data Journey Manifesto
Why it matters!
DataOps FAQ
All the basics of DataOps
DataOps 101 Training
Get certified in DataOps
Maturity Model Assessment
Assess your DataOps Readiness
DataOps Manifesto
Thirty thousand signatures can't be wrong!

 

DataKitchen Basics


About DataKitchen

All the basics on DataKitchen

DataKitchen Team

Who we are; Why we are the DataOps experts

Careers

Come join us!

Contact

How to connect with DataKitchen

 

DataKitchen News


Newsroom

Hear the latest from DataKitchen

Events

See DataKitchen live!

Partners

See how partners are using our Products

 

Monitor every Data Journey in an enterprise, from source to customer value, in development and production.

Simple, Fast Data Quality Test Generation and Execution. Your Data Journey starts with verifying that you can trust your data.

Orchestrate and automate your data toolchain to deliver insight with few errors and a high rate of change.