Tree-based ensembles have recently gained popularity in genome-wide association studies (GWASs) because particularly suited to discover interactions and non-linear effects. In this context, it is particularly important to discover a small subset of single-nucleotide polymorphisms (SNPs) associated to the outcome of interest (feature selection phase), as well as to provide results that are interpretable (rule extraction phase). Using a dataset of 300K SNPs from a previous study, we propose a method for feature selction based on the use of Random Forests in a two-stage approach in order to select SNPs relevant to the prediction of the estimated glomerular fi ltration rate (eGFR). The work focuses on the application of Random Forest for extremely large datasets along with three different wrappers around the Random Forest algorithm. The results of this analysis are compared to findings from the original GWA study, and demonstrate some overlap. Moreover, other additional SNPs have been identi ed as being potentially associated with the outcome. Subsequently, in order to overcome the limitations of black-box models, we carry out a rule extraction phase, as to obtain a clear model for interpretation purposes.

(2013). Machine Learning Methods for Feature Selection and Rule Extraction in Genome-wide Association Studies (GWASs).. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2013).

Machine Learning Methods for Feature Selection and Rule Extraction in Genome-wide Association Studies (GWASs).

NEMBRINI, STEFANO
2013

Abstract

Tree-based ensembles have recently gained popularity in genome-wide association studies (GWASs) because particularly suited to discover interactions and non-linear effects. In this context, it is particularly important to discover a small subset of single-nucleotide polymorphisms (SNPs) associated to the outcome of interest (feature selection phase), as well as to provide results that are interpretable (rule extraction phase). Using a dataset of 300K SNPs from a previous study, we propose a method for feature selction based on the use of Random Forests in a two-stage approach in order to select SNPs relevant to the prediction of the estimated glomerular fi ltration rate (eGFR). The work focuses on the application of Random Forest for extremely large datasets along with three different wrappers around the Random Forest algorithm. The results of this analysis are compared to findings from the original GWA study, and demonstrate some overlap. Moreover, other additional SNPs have been identi ed as being potentially associated with the outcome. Subsequently, in order to overcome the limitations of black-box models, we carry out a rule extraction phase, as to obtain a clear model for interpretation purposes.
ZUCCOLOTTO, PAOLA
PATTARO, CRISTIAN
machine learning, genome-wide association studies, feature selection, rule extraction
SECS-S/01 - STATISTICA
English
21-mar-2013
STATISTICA ED APPLICAZIONI - 62R
25
2011/2012
open
(2013). Machine Learning Methods for Feature Selection and Rule Extraction in Genome-wide Association Studies (GWASs).. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2013).
File in questo prodotto:
File Dimensione Formato  
phd_unimib_734383.pdf

Accesso Aperto

Tipologia di allegato: Doctoral thesis
Dimensione 1.67 MB
Formato Adobe PDF
1.67 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/43581
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
Social impact