“People are getting stuck with data saying ‘I have my infrastructure layer automated and self-serviced so a developer can push a button and an environment can be configured automatically. I have made my entire CI/CD pipeline, my entire software delivery life cycle automated. I can promate code. I can test code. I can automate testing. But the last layer is data. I need data everywhere,’” said Sanjeev Sharma, vice president and global practice director for data modernization and strategy at Delphix.
As a result, development teams are starting to turn to DataOps to help speed up that data layer. SD Times recently caught up with Sharma who spoke about what DataOps means, how to be successful, and what’s next for data.
Sharma: If you look at the history of the word DataOps, it started off mainly from the data science people — people wanting to do artificial intelligence and machine learning who had lost data assertion.
I was talking to a client of ours who was saying most data scientists don’t come from a computer science background, so their method of versioning data is “save as” and put a number at the end of the file name. It is that primitive. Of course he was exaggerating, but what he was saying is that there is no way to manage data.
Our perspective of DataOps is very simple. In your enterprise, you have data owners: people who create the data either because they: own the application; customers use that application and data is created; or the data is coming from logs [such as] telemetry data from a mobile application or a log data from something running in production. And then there are data managers. These are the database administrators and security people whose job is to manage the data, store it and secure it. Then there are the data consumers. These are your data scientists, your AI and ML experts, your developers and your testers who need the data to be able to do their job. How do you make these three sets of stakeholders work together and collaborate in a lean and efficient manner? That is DataOps.
It involves process improvement, and it involves technology.
The DataKitchen team wrote the manifesto, which is a data science company, so they have a data science-centric view of data, but the manifesto is a great thing. It sets up some of these things that I am talking about out in the open to say it is not just technology. It is not just building a data pipeline. If you don’t change the organizational ownerships and bring out the responsibility between the data consumers, data owners and data managers, you are not going to succeed. That explains it very well. I think it is a great opening move. I wouldn’t say it is the final word though.
DataOps has two perspectives. If you are looking at it from a data science lens, you are looking at how you got your data science activities to a stage where the biggest source of friction is the inability to get the right data at the right time to the right people.
From a DevOps lens, you are asking yourself if you have reached a stage where you are struggling with getting the right data to the right people at the right time… and you might not experience that unless you are Agile. If you still have a six month waterfall life cycle, six months is enough time to make a copy of a database. But if you are doing daily builds, true CI/CD and doing daily deployments to test environments — you need that data to be refreshed daily, sometimes multiple times a day. You need developers to be able to do local data sets for themselves, and be able to branch data to do A/B testing. You are more likely to hit that friction point when you have already done some level of automation around environments and code. Data won’t be what you address first.
Data managers are hired and paid to manage data, store it, make it available to the people who need it, and secure it to make sure they don’t get hacked. They are there to manage data in a lean and efficient manner. Making copies of data for data consumers is not their job. It is something a developer opens a ticket and tells a DBA to do. That ticket is the last one on the list because the database admin had other tickets that say this database needs to be finetuned because it is not performing properly; this database index needs to be reindex; I need to add a new database for this new production environment, or I am running out of storage. All those will have a higher priority over a developer asking for a copy.
Why not automate that and provide self service to the data consumer. It makes their job more efficient because they can focus on the high priority tasks like managing data schemas or making the database, rather than low level copies.
From a data owner perspective, if the data is not being used, what use is it? It is just being stored. It is just sitting there. They have data for 20 years, but the data consumer only has access to the last three years. To a business owner, they are looking at what information, what insights and inferences they are not able to access because of a policy that says I can’t give that to anyone. For them, they want data they can use as an asset which can be mined, used to draw inferences to better understand their customers, make better predictions, and make better investment decisions. Getting business value out of data is what DataOps brings to them.
DataOps by itself has no value in the sense of it has to be in the context of either you are doing a data science initiative and you need the data to be Agile for that initiative, or you are doing DevOps and you need data to be Agile for DevOps. A DataOps initiative has to be attached to a DevOps or data science initiative because it is serving that purpose of making data lean, Agile and available to the right people.
That train needs to be moving and DataOps is just making the track straighter and faster.
One of the tenets of DevOps is to make production-like environments available, which means the data should be production. It shouldn’t be synthetic data. Synthetic data doesn’t have the noise and the texture of production data. You will need synthetic data if you are building a new feature where data that doesn’t exist in production yet, but for everywhere else you want to put production data in your lower environments — but that raises security and compliance restrictions.
We at Delphix do masking of the data. We do it at two layers. We mask the data, so we will replace all the sensitive information with dummy information while maintaining the relational integrity.
The second thing we do is put in a lot of identity and access management controls. For instance, we can put in policies that say if the data is not masked and classified at this level you can not provision it to an overseas environment.
It is where DevOps was maybe 8 years ago where we were spending time explaining to people what was DevOps. Today, we don’t do that. We don’t need to explain to anyone what DevOps is. It is very well established, even though there are multiple definitions floating around, they are all at least on the same playing field.
With DataOps, I think we are still at that “what is DataOps and does it applies to me” stage. I would say there are still a couple of years before you have a DataOps Day or a conference dedicated to it.
Most of the world’s data is still living on a mainframe so that spectrum needs to be addressed. Our goal is to say no matter what kind of data, where it is,We will allow you to manage it like code.