The Ten Standard Tools To Develop Data Pipelines In Microsoft Azure.
While working in Azure with our customers, we have noticed several standard Azure tools people use to develop data pipelines and ETL or ELT processes. We counted ten ‘standard’ ways to transform and set up batch data pipelines in Microsoft Azure. Is it overkill? Don’t they all do the same thing? Is this the paradox of choice? Or the right tool for the right job? Let’s go through the ten Azure data pipeline tools
Azure Data Factory: This cloud-based data integration service allows you to create data-driven workflows for orchestrating and automating data movement and transformation.
Azure Databricks Workflows: An Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. You can use it for big data analytics and machine learning workloads. Workflows is a DAG runner embedded in Databricks. Databricks Notebooks are often used in conjunction with Workflows.
Azure Databricks Delta Live Tables: These provide a more straightforward way to build and manage Data Pipelines for the latest, high-quality data in Delta Lake.
SQL Server Integration Services (SSIS): You know it; your father used it. It does the job.
Azure Synapse Analytics Pipelines: Azure Synapse Analytics (formerly SQL Data Warehouse) provides data exploration, data preparation, data management, and data warehousing capabilities. It provides data prep, management, and enterprise data warehousing tools. It has a data pipeline tool, as well. We just learned of this too.
Azure Logic Apps: This service helps you schedule, automate, and orchestrate tasks, business processes, and workflows when integrating apps, data, systems, and services across enterprises or organizations.
Azure Functions: You can write small pieces of code (functions) that will do the transformations for you.
Oozie on HDInsight. Azure HDInsight: A fully managed cloud service that makes processing massive amounts of data easy, fast, and cost-effective. It can use open-source frameworks like Hadoop, Apache Spark, etc. Oozie is an open-source DAG runner.
Power BI dataflows: Power BI dataflows are a self-service data preparation tool. They allow users to create data preparation logic and store the output data in tables known as entities.
Azure Data Factory Managed Airflow: The super popular DAG runner managed by Azure
Extra Bonus: choose from one of the 100+ ETL and data pipeline tool options in the Azure Marketplace: DBT Cloud, Matillion, Informatica, Talend, yadda-yadda.
Indeed, designing patterns involving data pipelines often involves using multiple tools in conjunction, each with its strengths. Here are a few examples that we have seen of how this can be done:
Batch ETL with Azure Data Factory and Azure Databricks: In this pattern, Azure Data Factory is used to orchestrate and schedule batch ETL processes. Azure Blob Storage serves as the data lake to store raw data. Azure Databricks, a big data analytics platform built on Apache Spark, performs the actual data transformations. The cleaned and transformed data can then be stored in Azure Blob Storage or moved to Azure Synapse Analytics for further analysis and reporting.
Data warehousing with Azure Synapse Analytics and Power BI: Data from various sources can be ingested into Azure Synapse Analytics, where it can be transformed and modeled to suit the needs of your analysis. The transformed data is then served to Power BI, where it can be transformed, visualized, and further explored.
Machine learning workflows with Azure Machine Learning and Azure Databricks: Azure Databricks can be used to preprocess and clean data, then the transformed data can be stored in Azure Blob Storage. Azure Machine Learning can then use this data to train, test, and deploy machine learning models. The models can be served as a web service through Azure Functions or Azure Kubernetes Service.
Serverless data processing with Azure Logic Apps and Azure Functions: Azure Logic Apps can orchestrate complex workflows involving many different services. As part of these workflows, Azure Functions can be used to perform small pieces of data transformation logic. This pattern is suitable for scenarios where the volume of data is not too large and transformations can be performed statelessly.
There Will Never Be One “Best” Data Transformation Pipeline Tool.
Data transformation is a complex field featuring varying needs and requirements depending on data volume, velocity, variety, and veracity. This complexity leads to diverse tools, each catering to a different need. There are several reasons why there will never be one single “best” data transformation pipeline tool:
- Different use cases: Different tools are optimized for different use cases. Some tools are excellent for batch processing (e.g., Azure Data Factory), some are built for real-time streaming (e.g., Azure Stream Analytics), and others might be more suited for machine learning workflows (e.g., Azure Machine Learning). Depending on your use case, one tool (or a mix of tools) might be more suitable.
- Diverse Data Sources: In the modern world, data comes from various sources, including traditional databases, IoT devices, cloud services, APIs, and more. Each source might require a different tool to extract and manage data effectively.
- Processing Needs: The computational requirements can vastly differ based on the nature of the transformation. A lightweight, serverless function might adequately handle simple transformations, while complex, data-intensive tasks might need a robust big data solution like Azure Databricks.
- Scalability: Not all tools are designed to scale in the same way. Some might be excellent at handling small to medium data volumes but need help with petabyte-scale data.
- Expertise and Learning Curve: Different teams have different skill sets. A tool that is easy to use for a team experienced in Python might not be the best choice for a team specializing in SQL.
- Cost: Different tools have different pricing structures. One tool might be preferred over another, depending on the budget and cost-effectiveness.
The “toolbox approach” in Microsoft Azure and other similar platforms is beneficial. Azure provides a “shopping” experience, where based on the project’s needs, you can pick and choose the tools that best fit your requirements. This flexible and modular approach allows you to build a data pipeline tailored to your needs. In other words, it’s not about finding the best tool but the right tool (or tools) for the job.
Data Pipeline Tools And The Paradox of Choice
Psychologist Barry Schwartz studied the effects of more choices and noticed that having more choices doesn’t always help us choose better but can make us feel worse about what we chose, even if it was great. Reduced happiness arises from expectation escalation, lost opportunity cost, regret, and self-blame. When we continuously chase the cool new tool, are we focused on delivering value to our customers?
Schwartz may have been right, but simplicity may not be possible in your organization. Maybe you have a history of legacy data pipelines, your organization structure allows for freedom of tool choice, or your engineering philosophy is the ‘best tool for the job.’ More and more companies are using multiple tools with functional overlap. How can we operate successfully in this multi-tool world?
How To Deal With The Risk Of Multiple Tools
When considering how organizations handle serious risk, you could look to NASA. The space agency created and still uses “mission control” with many screens sharing detailed data about all aspects of a space flight. That shared information is the basis for monitoring mission status, making decisions and changes, and communicating with everyone involved. It is the context for people to understand what’s happening now and to review later for improvements and root cause analysis.
Any data operation, regardless of size, complexity, or degree of risk, can benefit from the same level of visibility as that enabled by DataKitchen DataOps Observability. Its goal is to provide visibility across every journey data takes from source to customer value across every tool, environment, data store, analytic team, and customer so that problems are detected, localized, and raised immediately. DataOps Observability does this by monitoring and testing every step of every data and analytic pipeline in an organization, in development and production, so that teams can deliver insight to their customers with no errors and a high rate of innovation.
DataOps Observability aims to provide visibility across every journey that data takes from source to customer value across every tool, environment, data store, data, analytic team, and customer so that problems are detected, localized, and raised immediately.
We call that multi-tool data assembly line a ‘Data Journey.’ And just like data in your database, the data across the Data Journey, and the technologies that make up the Data Journey, all have to be observed, tested, and validated to ensure success. Successful Data Journeys track and monitor all levels of the data stack, from data to tools to servers to code across all critical dimensions. A Data Journey supplies real-time statuses and alerts on start times, processing durations, test results, technology and tool state, and infrastructure conditions, among other metrics. With this information, data teams will know if everything is running on time and without errors and immediately detect the parts that didn’t.
So go ahead. Pick any mix of tools from Azure that you need to deliver to your customers. DataKitchen DataOps Observability has your back.
Learn more about how to deal with multiple tools successfully:
- Data Journey Manifesto https://datajourneymanifesto.org/
- Five Pillars of Data Journeys https://datakitchen.io/introducing-the-five-pillars-of-data-journeys/
- Data Journey First DataOps https://datakitchen.io/data-journey-first-dataops/
- DataKitchen DataOps Observability
- DataKitchen DataOps TestGen