Relational clustering for knowledge discovery in life sciences

Giordani, I

Clustering is one of the most common machines learning technique, which has been widely applied in genomics, proteomics and more generally in Life Sciences. In particular, clustering is an unsupervised technique that, based on geometric concepts like distance or similarity, partitions objects into groups, such that objects with similar characteristics are clustered together and dissimilar objects are in different clusters. In many domains where clustering is applied, some background knowledge is available in different forms: labelled data (specifying the category to which an instance belongs); complementary information about "true" similarity between pairs of objects or about the relationships structure present in the input data; user preferences (for example specifying whether two instances should be in same or different clusters). In particular, in many real-world applications like biological data processing, social network analysis and text mining, data do not exist in isolation, but a rich structure of relationships subsists between them. A simple example can be viewed in biological domain, where there are al lot of relationships between genes and proteins based on many experimental conditions. Another example, maybe common, is the Web search domain where there are relations between documents and words in a text or web pages, search queries and web users. Our research is focalized on how this background knowledge can be incorporated into traditional clustering algorithms to optimize the process of pattern discovery (clustering) between instances. In this thesis, we first provide an overview of traditional clustering methods with some important distance measures and then we analyze three particular challenges that we try to overcome with different proposed methods: "feature selection" to reduce high dimensional input space and remove noise from data; "mixed data types" to handle in clustering procedure both numeric and categorical values, typically of life science applications; finally, "knowledge integration" in order to improve the semantic value of clustering incorporating the background knowledge. Regarding the first challenge we propose a novel approach based on using of genetic programming, an evolutionary algorithm-based methodology, in order to automatically perform feature selection. Different clustering algorithms are been investigated regarding the second challenge. A modify version of a particular algorithm is proposed and applied to clinical data. Particularly attention is given to the final challenge, the most important objective of this Thesis: the development of a new relational clustering framework in order to improve the semantic value of clustering taking into account in the clustering algorithm relationships learned from background knowledge. We investigate and classify existing clustering methods into two principal categories: - Structure driven approaches: that are bound to data structure. The data clustering problem is tackled from several dimensions: clustering concurrently columns and rows of a given dataset, like biclustering algorithm or vertical 3-D clustering. - Knowledge driven approaches: where domain information is used to drive the clustering process and interpret its results: semi-supervised clustering, that using both labelled and unlabeled data, has attracted significant attention. This kind of clustering algorithms represents the first step to implement the proposed general framework that it is classified into this category. In particular the thesis focuses on the development of a general framework for relational clustering instantiating it for three different life science applications: the first one with the aim of finding groups of gene with similar behaviour respect to their expression and regulatory profile. The second one is a pharmacogenomics application, in which the relational clustering framework is applied on a benchmark dataset (NCI60) to identify a drug treatment to a given cell line based both on drug activity pattern and gene expression profile. Finally, the proposed framework is applied on clinical data: a particular dataset containing different information about patients in anticoagulant therapy has been analyzed to find group of patients with similar behaviour and responses to the therapy.

Bicocca Open Archive