The term DataOps is currently gaining a lot of traction, with solutions emerging that have significantly matured. Let's dicscuss whatDataOps is all about.
I can start by citing the first sentence of the Wikipedia page, which reads: "DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics.” Beyond this condensed summary, this means that DataOps makes it possible to quickly meet the data analytics needs of the business, with reliable figures. Quickly because all the tools are industrialized and facilitate collaborative work, but also because we are confident of the quality of the figures reported.
So, what are the necessary profiles and their associated roles? Here is the list:
The idea in the end is to have two pipelines. A continuous data ingestion pipeline, and a pipeline for new developments, which meet during data production. Ideally, therefore, a unified platform is needed to handle all this and centralize people around the same tool. Tools exist, such as DataKitchen or Saagie, to monitor the data production chain. This chain, where the typical steps of data access, transformation, modeling, and visualization and reporting are performed, must be able to be followed from start to finish, but also allow for a unified view of the non-regression tests. The tests to be implemented are the typical tests that we are used to having, but to which we will add "Statistical process control" tests. These tests consist in detecting that the returned metrics remain in normal numbers. If you measure stock consumption in a factory, you do not normally expect to increase by 50% in one month. The subject of the SPC is a rather broad one, and which would greatly deserve a book-length treatment; but, I'll just redirect you to the Wikipedia link first.
In terms of capabilities, you also need a personal sandbox for everyone. Except that the sandbox must contain a fresh local dataset. And, of course, all this should be performed with version management! This allows you to properly manage the whole big data ecosystem that you will orchestrate, from the recovery of the data to its final restitution to business people.
All this in order to set up a Datops process, where the steps are as follows:
I hope this article was useful as an introduction to DataOps, and will help you in your DataOps adoption!