The ever-increasing availability of data generated from sequencing experiments of biological samples brings the need for efficient, scalable and reproducible algorithmic strategies for their analysis and integration. This is particularly true in cancer research, in which large amounts of high-dimensional data at the single-cell resolution can be now generated from patient biopsies and patient-derived models. In this work, I will present the achievements obtained in three main areas, namely: (A) the development of methods for the analyses of single omics data types (DNA, RNA and ATAC), (B) the design of strategies for their integration, (C) the implementation of a reproducible, scalable and flexible pipeline for the comprehensive analysis of single-cell data, which includes both the new methods developed in tasks (A) and (B), and additional state of the art techniques. All these tasks were carried out within the Bioinformatics programme of the "Single-cell Cancer Evolution in the Clinic"(SCCEiC) CRUK/AIRC Accelerator project, which aims at integrating the efforts of wet- and dry-lab scientists to deliver a fine characterization of cancer evolution, with expected repercussions in clinical settings. However, the achievements of this work will have a more general impact, especially on the broad field of computational science and, in particular, in that of cancer data science and biomedical Artificial Intelligence. In regard to task (A), I will show the most extensive to-date benchmarking of denosing and imputation methods for single-cell RNA-sequencing data, which might guide researchers both in the application of existing methods to real-world problems, and in the design of new algorithmic strategies. I will then present a new algorithmic framework applied to the inference of clonal trees from single-cell mutational profiles, which provides both the first method to characterize and visualise the solution space explored during \acrshort{mcmc} search, and a new method to reconstruct a consensus optimal tree summarising the explored solutions. In regard to task (B), I will illustrate two methods for the diagonal integration of multimodal data, which aim at integrating DNA-RNA and DNA-RNA-ATAC, respectively. Both frameworks exploit a sound Bayesian framework and learn the parameters through stochastic variational inference, and have the goal of mapping multiple omics on the latent space of genomic alterations. Not only the performance of both methods proved robust on simulations, but the application to real-world datasets showed the effectiveness in producing usable knowledge with translational relevance. Last, in task (C) the efforts were directed toward the definition of a comprehensive pipeline, with the general goal of enhancing the reproducibility and standardization of data analysis workflow. The resulting pipeline includes multiple building blocks taylored to the distinct omic data types, and - in its current form - includes either state-of-the-art methods or techniques developed during tasks (A) and (B). All in all, starting from the specific questions of the the SCCEiC project, the achievements of this work produced theoretical frameworks and tools that proved effective in extracting knowledge from complex experimental settings and in generating of data-driven experimental hypotheses, confirming the current necessity of multi-disciplinary efforts in real-world scenarios.
La crescente disponibilità di dati generati da esperimenti di sequenziamento di campioni biologici richiede strategie algoritmiche efficienti, scalabili e riproducibili che ne permettano analisi e integrazione. Questo fenomeno si osserva anche nell'ambito della ricerca sul cancro, dove grandi quantità di dati ad alta dimensionalità a risoluzione delle singole cellule possono essere oggi generati a partire da biopsie e organoidi derivati da pazienti. In questo lavoro, presenterò i risultati ottenuti in tre task principali, ovvero: (A) lo sviluppo di metodi per l'analisi di diversi tipi di dati omici (DNA, RNA e ATAC), (B) la progettazione di strategie per la loro integrazione e (C) l'implementazione di una pipeline riproducibile, scalabile e flessibile per l'analisi dei dati a risoluzione di singola cellula, che include sia i nuovi metodi sviluppati all'interno dei task (A) e (B), sia ulteriori tecniche presenti nello stato dell'arte. Tutte queste attività sono state svolte all'interno del programma di Bioinformatica del progetto Accelerator CRUK/AIRC dal titolo "Single-cell Cancer Evolution in the Clinic" (SCCEiC), che mira a integrare gli sforzi di biologi ed esperti computazionali per fornire una caratterizzazione precisa dell'evoluzione del cancro, con ripercussioni anche in contesti clinici. Tuttavia, i risultati di questo lavoro avranno un impatto più generale, soprattutto nel campo delle scienze computazionali e, in particolare, in quello della data science applicata al cancro (cancer data science) e dell'Intelligenza Artificiale. Per quanto riguarda il task (A), mostrerò il più esteso benchmarking ad oggi disponibile riguardante metodi di denosing e di imputazione per i dati di sequenziamento a singola cellula di RNA, che potrebbe guidare i ricercatori sia nell'applicazione di metodi esistenti a problemi reali, sia nella progettazione di nuove strategie algoritmiche. Presenterò quindi un nuovo algoritmo applicato all'inferenza di alberi clonali a partire da profili mutazionali a singola cellula, che fornisce sia il primo metodo per caratterizzare e visualizzare lo spazio delle soluzioni esplorato da metodi basati su durante la ricerca MCMC, sia un nuovo metodo per ricostruire un albero a consenso che riassuma le soluzioni esplorate. Riguardo il task (B), verranno illustrati due metodi per l'integrazione diagonale di dati multimodali, che mirano ad integrare DNA-RNA e DNA-RNA-ATAC, rispettivamente. Entrambi i metodi sfruttano un robusto framework bayesiano e apprendono i parametri attraverso l'inferenza variazionale stocastica, e hanno l'obiettivo di mappare più omiche sullo spazio latente delle alterazioni genomiche. Oltre a dimostrare attraverso dati simulati la robustezza di entrambi i metodi, l'applicazione a dati reali mostra l'efficacia di entrambi metodi nell'estrazione di conoscenza con rilevanza traslazionale. Infine, all'interno del task (C), gli sforzi sono stati convogliati verso la definizione di una pipeline completa, con l'obiettivo generale di migliorare la riproducibilità e la standardizzazione dei flussi di lavoro nell'analisi dati. La pipeline costruita include più blocchi progettati per essere applicati ai diversi tipi di dati omici e, nella sua forma attuale, include sia metodi sviluppati all'interno dei task (A) e (B), sia tecniche presenti nello stato dell'arte. In generale, partendo dalle specifiche domande sorte all'interno del progetto SCCEiC, i risultati di questo lavoro hanno prodotto framework teorici e strumenti che si sono dimostrati efficaci nel produrre risultati a partire da basi sperimentali complesse. Si sono inoltre rivelati capaci di generare ipotesi sperimentali a partire dal dato, confermando la necessità di sforzi multidisciplinari all'interno di molti contesti reali.
(2023). Computational strategies for single-cell multi-omics data analysis and integration. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2023).
Computational strategies for single-cell multi-omics data analysis and integration
PATRUNO, LUCREZIA
2023
Abstract
The ever-increasing availability of data generated from sequencing experiments of biological samples brings the need for efficient, scalable and reproducible algorithmic strategies for their analysis and integration. This is particularly true in cancer research, in which large amounts of high-dimensional data at the single-cell resolution can be now generated from patient biopsies and patient-derived models. In this work, I will present the achievements obtained in three main areas, namely: (A) the development of methods for the analyses of single omics data types (DNA, RNA and ATAC), (B) the design of strategies for their integration, (C) the implementation of a reproducible, scalable and flexible pipeline for the comprehensive analysis of single-cell data, which includes both the new methods developed in tasks (A) and (B), and additional state of the art techniques. All these tasks were carried out within the Bioinformatics programme of the "Single-cell Cancer Evolution in the Clinic"(SCCEiC) CRUK/AIRC Accelerator project, which aims at integrating the efforts of wet- and dry-lab scientists to deliver a fine characterization of cancer evolution, with expected repercussions in clinical settings. However, the achievements of this work will have a more general impact, especially on the broad field of computational science and, in particular, in that of cancer data science and biomedical Artificial Intelligence. In regard to task (A), I will show the most extensive to-date benchmarking of denosing and imputation methods for single-cell RNA-sequencing data, which might guide researchers both in the application of existing methods to real-world problems, and in the design of new algorithmic strategies. I will then present a new algorithmic framework applied to the inference of clonal trees from single-cell mutational profiles, which provides both the first method to characterize and visualise the solution space explored during \acrshort{mcmc} search, and a new method to reconstruct a consensus optimal tree summarising the explored solutions. In regard to task (B), I will illustrate two methods for the diagonal integration of multimodal data, which aim at integrating DNA-RNA and DNA-RNA-ATAC, respectively. Both frameworks exploit a sound Bayesian framework and learn the parameters through stochastic variational inference, and have the goal of mapping multiple omics on the latent space of genomic alterations. Not only the performance of both methods proved robust on simulations, but the application to real-world datasets showed the effectiveness in producing usable knowledge with translational relevance. Last, in task (C) the efforts were directed toward the definition of a comprehensive pipeline, with the general goal of enhancing the reproducibility and standardization of data analysis workflow. The resulting pipeline includes multiple building blocks taylored to the distinct omic data types, and - in its current form - includes either state-of-the-art methods or techniques developed during tasks (A) and (B). All in all, starting from the specific questions of the the SCCEiC project, the achievements of this work produced theoretical frameworks and tools that proved effective in extracting knowledge from complex experimental settings and in generating of data-driven experimental hypotheses, confirming the current necessity of multi-disciplinary efforts in real-world scenarios.File | Dimensione | Formato | |
---|---|---|---|
phd_unimib_795291.pdf
accesso aperto
Descrizione: Computational strategies for single-cell multi-omics data analysis and integration
Tipologia di allegato:
Doctoral thesis
Dimensione
17.72 MB
Formato
Adobe PDF
|
17.72 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.