Population stratification in genome-wide association studies: a comparison among different multivariate analysis methods for dimensionality reduction

Menni, C

INTRODUCTION: Genome-wide association studies (GWAS) are large-scale association mapping using SNPs, making no assumptions on the genomic location of the causal variant. They hold substantial promise for unraveling the genetic basis of common human diseases. A well known problem with such studies is population stratification (PS), a form of confounding which arises when there are two or more strata in the study population, and both the risk of disease and the frequency of marker alleles differ between strata. It therefore may appear that the risk of disease is related to the marker alleles when in fact it is not. Many statistical methods were developed to account for PS so that association studies could proceed even in the presence of structure and for GWAS, linear principal components analysis (PCA) represents a sort of gold-standard. PCA uses genotype data to extract continuous (principal) axis of variation, which can be used to adjust for association attributable to ancestry along each axis. The assumption underlying PCA, however, is that the variable under studies are continuous and so SNPs are quantified by fixing for each marker a reference and a variant allele and by counting the number of mutations. This implies that the distance between homozygous wild type and heterozygous is the same as the distance between heterozygous and homozygous mutant and it thus implies an additive model of inheritance. This model is very conservative, is very static and most importantly it is not necessarily the correct one. AIM: The aim of this thesis is to treat SNPs as ordinal qualitative variables. This means that there is a distance between homozygous wild type, heterozygous and homozygous mutant, but that the distance between each pair is not necessarily the same. So, we no longer assume any model of inheritance and can potentially better capture some information that linear PCA misses out. METHODS: We apply a multivariate technique to reduce dimensionality in the presence of non-metric data known as non linear principal components analysis (NLPCA, also known as PRINCALS: Principal components analysis by means of alternating least squares). PRINCALS belongs to “Gifi’ s system”, a unified theoretical framework under which many well known descriptive multivariate techniques are organised. We apply both PCA and PRINCALS to a sample dataset of 90 individuals belonging to three very distinct subpopulations and 1,000 randomly chosen uncorrelated SNPs and compare the results graphically, using Procrustean superimposition approach and the test Protest and finally with a scenarios analysis. RESULTS: When we compare the performances of PCA and PRINCALS, we find that the two methods yield similar scores for markers with a low/null genotypic variability across the study sample, while scores differ as the level of genotypic variability increases. This suggests that the two methods capture intra-subject variability differently. Procrustes analysis and scenarios analysis confirm this. Indeed, the matrix of principal components obtained with PCA and the matrix of dimensions obtained with PRINCALS are shown to be statistically different by the test PROTEST and, in the scenarios analysis, we find that, as the level of PS increases, PRINCALS appears to outperform PCA. CONCLUSION: PCA and PRINCALS behave differently. Validation analyses are needed to confirm these results.

(2011). Population stratification in genome-wide association studies: a comparison among different multivariate analysis methods for dimensionality reduction. (Tesi di dottorato, Università degli Studi di Milano-Bicocca, 2011).