From Cattle to Clarity: Visualizing Thousands of Data Pipelines with Violin Charts
ย
Most data teams work with a dozen or a hundred pipelines in production.ย What do you do when you have thousands of data pipelines in production?ย How do you understand what is happening to those pipelines?ย Is there a way that you can visualize what is happening in production quickly and easily?ย Weโve developed a new feature in the enterprise cloud version of our DataOps Observability product that does just that. ย It is called the Violin Chart.ย We will discuss how and why we use it for data pipeline durations.ย Stop fiddling with time!
One of our customers, who runs several thousand pipelines daily, wanted to know more about their pipelines.ย They treat their pipelines as cattle, not pets.ย The following chart shows run durations. The list of pipelines is on the Y-axis (truncated for privacy) and is sorted so you can easily see the pipelines at the top take the most time to run.ย That is useful if you want to focus on what is slow.ย What the chart does not show is the trend over time.ย What is an effective way to see the trend? ย
This article will explain a violin chart and give three examples of how you can quickly spot anomalies with this chart.
What is a violin chart?
Our customer suggested we use a violin chart.ย I thought I was up on the latest charts and graphs after working with data (I won’t say how many years) and taking a Tufte on the display of quantitative information course on my birthday (in person when there were 3, not 5, books). “The violin plot was proposed in 1997 by Jerry L. Hintze and Ray D. Nelson as a way to display even more information than box plots [also called box-and-whisker plots], created by John Tukey in 1977. The name comes from the plot’s alleged resemblance to a violin.” [Wikipedia]
What is a violin chart good for?
You can see the minimum and maximum values and a histogram of the values.ย The shape of each violin lets you visually spot when something is different.
Real-world examples
Looking at these charts, you can see that the runs on August 5 and August 6 had longer-than-usual durations.
Most of the runs are around 2.5 minutes.ย There are outliers of 100 minutes on August 5 and 6.ย This should be monitored for future occurrences and can be a cause for investigation.ย This is a convenient way to absorb lots of data because there are 3,000 runs of Pipeline 1 on a typical day.
In this next chart of Pipeline 2, the shape of the violin changed on August 26 and 27. At a glance, we can see that there are more longer runs, some significantly longer.ย ย
On August 26, the maximum is about 85 minutes, and on August 27, it is 50 minutes.
In this final chart, the shape of this sporadic job changed on Aug 23.ย Usually, this job takes less than 30 seconds.ย On 08-23, one run was below 30 seconds, but the others were over 2 minutes.
The violin chart is helpful for not just seeing what jobs took longer than usual. Because the chart groups them and creates a histogram, it is easy to see trends and anomalies and pinpoint areas for further investigation, which can, therefore, improve your Data Operations.