Building robust data pipelines often requires spe-cialized engineering skills, creating barriers for domain experts with limited coding expertise. We introduce Prompt2DAG, a modular prompting methodology that transforms natural language descriptions into executable Apache Airflow workflows by decomposing generation into three sequential stages: structured analysis, configuration generation, and code implementation. This approach aligns with established software engineering principles of separation of concerns and progressive refinement. Our evalu-ation across five different LLMs demonstrates that Prompt2DAG significantly outperforms conventional end-to-end generation, im-proving code quality (+78.4 %) and structural integrity (+43.2 %) of generated pipelines. Using a data enrichment case study, we show how this approach enables the development of high-quality workflows through natural language, effectively democratizing data pipeline development.
Alidu, A., Ciavotta, M., De Paoli, F. (2025). Prompt2DAG: A Modular Prompting Approach for Democratizing Data Pipeline Generation. In 2025 IEEE International Conference on Software Services Engineering (SSE) (pp.1-11). Institute of Electrical and Electronics Engineers Inc. [10.1109/SSE67621.2025.00010].
Prompt2DAG: A Modular Prompting Approach for Democratizing Data Pipeline Generation
Alidu A.;Ciavotta M.;De Paoli F.
2025
Abstract
Building robust data pipelines often requires spe-cialized engineering skills, creating barriers for domain experts with limited coding expertise. We introduce Prompt2DAG, a modular prompting methodology that transforms natural language descriptions into executable Apache Airflow workflows by decomposing generation into three sequential stages: structured analysis, configuration generation, and code implementation. This approach aligns with established software engineering principles of separation of concerns and progressive refinement. Our evalu-ation across five different LLMs demonstrates that Prompt2DAG significantly outperforms conventional end-to-end generation, im-proving code quality (+78.4 %) and structural integrity (+43.2 %) of generated pipelines. Using a data enrichment case study, we show how this approach enables the development of high-quality workflows through natural language, effectively democratizing data pipeline development.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


