Pipeline As a Service (PaaS): our journey from centralised to self-service data workflows
von Luis Coelho
In this article, Luis will share the journey of transitioning from a team that creates data pipelines for stakeholders to establishing a self-service framework, which aligns with our vision of building a data mesh.
In the Data Infrastructure team at BurdaForward, our primary aim is to empower our stakeholders-–or data domain teams--to make data-informed decisions. This entails providing robust infrastructure, including Airflow, Snowflake, and data quality tools, fostering autonomy and collaboration, and reducing barriers to the development of data products, while maintaining standards, governance and guard rails. With the ultimate goal of establishing a culture of ownership and innovation throughout the organisation.
The persistent reliance on monolithic data management structures within the realm of data analytics, typically managed by single data engineering teams, frequently burdens these teams, resulting in slowdowns and low data quality.
To circumvent these challenges, our current approach integrates principles from the data mesh framework: we strive for decentralised data ownership alongside centralised governance. This strategy not only disperses decision-making authority but also facilitates unified oversight, thereby alleviating bottlenecks and creating a richer data ecosystem (more on this later).
In this article, I will share our journey transitioning from a team that creates data pipelines for stakeholders to establishing a self-service framework, which aligns with our vision of building a data mesh.
Understanding Data Pipelines: The Core of Data Platforms
First, let’s make sure some basic concepts are clear for all readers. What are data pipelines, DAGs and workflow management systems?
A data pipeline consists of interconnected processing elements where the output of one feeds into the next. These pipelines efficiently collect, process, and deliver data to their destination, often a data warehouse, data lake or analytics platform. They are integral for data-driven decision-making and insights.
In a typical data pipeline, data is fetched from a source (e.g., other tables, APIs) and then subjected to processing, which may involve tasks like duplicate removal, anonymisation, and additional modeling. Once processed, the data is stored in its destination—such as an object storage bucket or database—like Snowflake, for instance. Subsequently, stakeholders can access the data via an analytics tool (Superset in our case) to create visualisations.
DAGs
DAGs, or Directed Acyclic Graphs, serve as a graphical representation of data pipelines, illustrating the sequence of data processing tasks and dependencies. The term "acyclic" underscores the unidirectional flow of operations within the graph, meaning that no loops or cycles exist.
Workflow Management
Given that pipelines, or DAGs, run at specific time intervals, the necessity for a workflow management platform becomes evident in effectively managing the complexity that arises from handling hundreds of pipelines and beyond. At BurdaForward, our tool of choice is Apache Airflow: we maintain a Managed Workflows for Apache Airflow (MWAA) Environment which is shared among all stakeholders.
The birth of Pipeline as a Service (PaaS)
BurdaForward is one of the largest digital media companies in Germany, managing a portfolio of 16 brands with very high reach and user interaction. We process terabytes of data daily. Our data pipelines enable stakeholders to harness the potential of data at scale to derive meaningful insights. Ensuring that data pipelines are adopted by teams as a vital tool is essential to fulfilling our team’s mission.
Initially, we created, owned and maintained all DAGs ourselves, based on requirements from stakeholders. This centralised approach limited scalability and hampered the agility required to meet evolving demands. Furthermore, assuming ownership of the DAGs, despite not being the most familiar with the data, made it challenging to address issues in the pipelines and implement changes efficiently.
It became clear quite quickly that we needed to transition to a model that would empower us to delegate ownership to the tenants, enabling them to assume full responsibility for their pipelines. This led to the birth of our Pipeline as a Service (PaaS). For the tenants, this shift would represent a game-changer—enabling quicker deployment and eliminating us as bottlenecks.
PaaS: Early Implementation
The first iteration of the service was quite basic. We provided stakeholders with template-generated repositories, which contained code enabling them to set up a local development container with Airflow for testing their DAGs. The continuous delivery (CD) pipeline definition in these repositories then deployed to our AWS Managed Workflows for Apache Airflow (MWAA) source bucket.
Stakeholders had complete freedom to create DAGs using Python as they saw fit. We offered a set of custom operators they could use. However, this process still introduced bottlenecks and slowdowns as each DAG creation required review by a member of our team. Compounding the issue, BurdaForward has numerous teams with varying technical proficiencies, necessitating extensive handholding and support.
It became a drain on our team: with many pull requests to review each week and the iterative exchanges they created; stakeholders grew increasingly frustrated with the prolonged deployment times for their DAGs. Moreover, our team had to review code related to data or use cases that we were not familiar with.
On top of that, we did not have clear constraints to manage resource utilisation, such as setting limits on concurrent task execution, which began to present challenges. This resulted in resource shortages during pipeline execution, leading to errors that further frustrated stakeholders. This also resulted in higher operational costs, as optimal resource utilisation wasn’t being properly enforced.
Additionally, each time we onboarded a new stakeholder, we duplicated the template repository. Consequently, even a minor change to the continuous delivery (CD) deployment logic caused significant disruptions, requiring us to adapt every single repository.
The decentralised and inconsistent management of resources left both the Data Platform team and stakeholders dissatisfied, as they struggled with increased errors, slower deployments, and escalating frustrations. Summary: Data Infrastructure team unhappy, stake holders unhappy.
Overcoming Early Operational Hurdles
Our approach at this point was not fully aligned with our guiding principles. We found ourselves personally reviewing code and providing extensive guidance, which meant stakeholders weren't truly becoming independent. Instead of minimising obstacles, we often ended up shouldering the workload to compensate. Stakeholders were struggling, pipelines were breaking, reviews were taking too long, and they couldn't get their work done; they were effectively being blocked. It was clear from examining the diagrams that the bottleneck was still present, albeit in a different place.
The balance between the central platform providing features and the complexity of decentralising it for tenants was off. We needed to decouple the centralised part, which provided features (operators) and guardrails (schema validation, governance), from the autonomous part, which allowed stakeholders to operate at their own pace without a middleman.
We implemented the following changes:
Centralised Deployment Logic
Tenants still define DAGs in their repositories, but we centralised the deployment pipeline definition. PaaS repositories now reference this centralised pipeline, allowing us to implement changes efficiently by updating a single location rather than modifying each individual repository. This approach streamlined the deployment process, reduced disruptions, and ensured consistent deployment logic across all repositories.
DAG Factory
We introduced the Dag Factory, an abstraction layer which enables tenants to define their DAGs using JSON. This meant that tenants no longer needed to have knowledge of Python or work with Docker locally. Instead, they could focus on defining their workflows using straightforward JSON files.
Stakeholders could now deploy their DAGs independently, without any intervention from our team. If a DAG encountered an issue, the relevant stakeholders would receive alerts in their chosen communication channels, enabling them to promptly address and fix any problems.
We provide comprehensive documentation on the available operators through the framework, ensuring that users had all the necessary resources to create their DAGs effectively.
The DAG Factory module converts JSON files into Python Airflow DAG definitions, ensuring standardised configurations to prevent resource overuse our shared environment. It incorporates built-in validation mechanisms that eliminate the need for manual reviews and offer gatekeeping controls for managing limits and features. Under the hood, we use Pydantic to define validation classes and Jinja templates to render Python files.
External Task Execution: Airflow as an Orchestrator
Airflow uses slots to manage the parallel execution of tasks. The number of slots can be configured, but to put it simply: with MWAA, more slots mean more resources being utilised, especially workers, as MWAA auto-scales. There are tasks, like running a DBT job on Snowflake, that essentially involve sending a query to be processed remotely and waiting. Although no real computing is being done by the workers during the waiting period, a slot can be occupied if misconfigured, leading to cascading consequences for resource utilisation and the efficiency of our pipelines.
We offer stakeholders standardised operators to accomplish mainstream use cases within the organisation. Operators form the foundation of Airflow DAGs, encapsulating the logic for data processing within a pipeline. We ensure all PaaS operators are deferrable, which means they release their slot while waiting for the external task to complete. In order to do so, we use AWS Elastic Container Service (ECS) to execute the operators, which are defined as separate Python modules.
Tasks, or pipeline steps, that have specific dependencies—referred to as Custom Steps (similar to PythonOperator in traditional Airflow)—can be defined by stakeholders. Their respective images on Amazon Elastic Container Registry (ECR) can be referenced in their JSON DAG definitions. We provide a template that also handles ECS permissions, simplifying the process for stakeholders.
The main idea is to use Airflow solely as an orchestrator, avoiding the execution of actual code on its workers. This approach offers several advantages, including optimised slot allocation and reduced costs. Managing Python dependencies or third-party libraries in Airflow can be challenging, often leading to dependency conflicts. By encapsulating each task as a separate script or image with its own set of dependencies, we can effectively mitigate these issues.
PaaS Today
Currently, Pipeline as a Service (PaaS) is essentially a comprehensive package comprising several key components:
- Airflow environment and DAGs Repositories, which form the backbone of our pipeline framework.
- An abstraction layer known as the DAG Factory, facilitating the streamlined definition of pipelines.
- A collection of Airflow operators tailored to address primary use-cases, complemented by customisable operators to accommodate specific requirements.
- Ongoing support to ensure smooth operation and troubleshooting assistance when needed.
- DBT project: the bundle can be setup with a DBT project which connects seamlessly to our Airflow instance.
Together, these components enable our stakeholders to efficiently design, deploy, and manage data pipelines, allowing them to focus on deriving insights and driving business value without the burden of infrastructure management. With the ease of setup, we can create a new repository capable of deploying pipelines within minutes and a DAG can be deployed to production as soon as it is ready.
The goal is to simplify data usage and pipeline creation for teams without Airflow expertise, while avoiding deployment delays and configuration errors. Through abstraction, DAG creation is simplified to a basic JSON file, with validation models automatically approving tenants' DAGs to bypass code reviews.
Pain Points + Future plans
We have made significant progress, but there is still much to accomplish. Our priorities include:
- Extend support for built-in providers’ (e.g., AWS, Snowflake) and Airflow operators via PaaS.
- Better understand stakeholder requirements to address generalisable use cases with PaaS operators.
- Develop better documentation with comprehensive examples.
- Clearly differentiate between the testing environment (sandbox) and production.
- Improve error messages to guide stakeholders through validation errors.
- Create interactive documentation (e.g., Swagger) for instant validation and faster feedback.
- Enhance system interoperability and integrate LLM support for DAG creation to improve accessibility.
By focusing on these areas, we aim to provide a more robust and user-friendly experience for our stakeholders.