Testing Deep Learning Systems: From Practice to Prioritization in Autonomous Driving Systems

Ali, Q

Deep learning (DL) has become a core enabling technology for autonomous driving systems (ADS), yet systematic testing and validation of such systems remain a major challenge due to their data-driven nature, vast input spaces, and strong dependence on execution context. In particular, existing testing practices, benchmark reuse, and regression testing strategies have not kept pace with the scale and complexity of modern DL-based ADS. This thesis investigates these challenges through a multi-layered empirical and methodological study, spanning software engineering practices, infrastructure interoperability, and behavior-aware test optimization. First, the thesis presents a large-scale empirical study of testing practices in open-source Python deep learning projects. By analyzing 300 DL repositories, the study characterizes test adoption, automation, coverage practices, and test suite evolution over time. The results show that, although test adoption increases as projects mature, model-specific tests and non-functional testing remain underrepresented. Moreover, test suites grow substantially as projects evolve, intensifying the need for scalable and efficient regression testing. These findings expose systematic maturity gaps in DL testing and motivate the development of more principled testing methodologies, particularly for safety-critical domains such as autonomous driving. Second, the thesis investigates the reusability of large-scale ADS benchmarks across heterogeneous simulation platforms. It introduces OpenCat, an open-source infrastructure that enables high-fidelity conversion of OpenDRIVE road representations into Catmull-Rom splines, allowing industry-grade benchmarks such as SensoDat to be executed in lightweight academic simulators. While OpenCat achieves near-perfect geometric fidelity across more than 32,000 scenarios, extensive cross-platform evaluation reveals substantial divergence in pass/fail outcomes when scenarios are executed using different Advanced Driver Assistance Systems (ADAS) models. These results expose a fundamental limitation of geometry-centric benchmarking: test outcomes are tightly coupled to the underlying system architecture and execution context, leading to model-specific brittleness that undermines benchmark interoperability and reproducibility. Finally, the thesis proposes a behavior-aware test suite reduction and prioritization framework for ADS regression testing. The framework combines geometric properties of road segments with dynamic behavioral features extracted from execution traces to cluster, select, and prioritize test scenarios. An extensive evaluation across multiple driving environments and imitation-learning ADAS models demonstrates that the proposed approach reduces test execution cost by up to 89% while retaining the majority of failure-inducing scenarios and significantly improving early fault detection compared to random, geometric-only, and behavior-only baselines. Cross-model experiments further reveal partial transferability of behavior-informed prioritization, highlighting both generalizable driving difficulty patterns and architecture-dependent vulnerabilities. Overall, this thesis advances the state of the art in DL-based ADS testing by (i) empirically characterizing testing maturity in real-world DL projects, (ii) exposing fundamental interoperability and model-dependence limitations of existing benchmarks, and (iii) introducing scalable, behavior-aware testing techniques that improve regression efficiency without sacrificing fault detection capability. The findings underscore the need to move beyond purely geometric and syntactic notions of test adequacy toward behavior-centric and architecture-aware testing methodologies for reliable and reproducible ADS validation.

Ali, Q (2026). Testing Deep Learning Systems: From Practice to Prioritization in Autonomous Driving Systems. (Tesi di dottorato, , 2026).