Data integration for clinical genomics

Calabria, A

Genetics and Molecular Biology are keys for the understanding the mechanisms of many of the human diseases that have strong harmful effects. The empirical mission of Genetics is to translate these mechanisms into Clinical benefits, thus bridging in-silico findings to patient bed side: approaching this goal means achieving what is commonly referred as clinical genomics or personalized medicine. In this process, technologies are assuming an increasing role. With the introduction of new experimental platforms (microarrays, sequencing, etc), today's analyses are much more detailed and can cover a wide spectrum of applications, from gene expression to Copy Number Variants detection. The advantages of technological improvements are usually followed by data management drawbacks due to the explosion of data throughput that reflects on a real need for new systems of data rationalization and management, data access, query and extraction. Our genetic laboratories partners encountered all those issues: what they need is a tool that allows data-integration and supports biological data analysis exploiting computational infrastructures on distributed environment. From such needs, we defined two main goals: (1) Computer Science goal: to design and implement a framework that integrates and manages data and genetic analyses; (2) Genetics and Molecular Biology goal (application domains): to solve biological problems through the framework and develop new methods. Given these requirements and related specifications, we designed an extensible framework based on three inter-connected layers: (1) Experimental data layer, that provides data integration of data from high-throughput platforms (also called horizontal data integration); (2) Knowledge data layer, that provides data integration of knowledge data (also called vertical integration); (3) Computational layer, that provides access to distributed environments for data analysis, in our cases GRID and Cluster technologies. Above the three design blocks, single biological problems can be supported and custom user interfaces are implemented. From our partner laboratories, two main relevant biological problems have been addressed: (1) Linkage Analysis: given a large pedigree in which subjects were genotyped with chips of 1 million of SNPs, the linkage analysis problem presented real computational limits. We designed a heuristic method to overcome computational restrictions and implemented it within our framework, exploiting GRID and Cluster environments. Using our approach, we obtained genetic results, successfully validated by end-users. We also tested performances of the system, reporting compared results. (2) SNP selection and ranking: given the problem of ranking SNPs based on a-priori information, we developed a novel method for biological data mining on genes' annotations. The method has been implemented as a web tool, SNP Ranker, that is under deep validation by our partners laboratories. The framework here designed and implemented demonstrated that this approach is consistent and can have potential impacts on the scientific community.

(2011). Data integration for clinical genomics. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2011).