4.6 Discussion

Overall, accounting for the linkage disequilibrium structure of the genome and aggregating highly-correlated SNP is seen to be a powerful alternative to standard marker analysis in the context of GWAS. In terms of risk prediction, our algorithm proves to be very effective at classifying individuals given their genotype, while in terms of the identification of loci, it shows its ability to identify genomic regions associated with a disease with a higher precision than standard methods.

It is also worth mentioning that our algorithm can also accommodate imputed variables as imputation in GWAS uses the linkage disequilibrium between variables to improve the coverage of variants. Our method being based on LD to define groups of common variants, we expect the group structure not to be impacted by imputation.

In this work we propose a four-step method explicitly designed to utilize the linkage disequilibrium in GWAS data. Our method combines, on the one hand, unsupervised learning methods that cluster correlated-SNP, and on the other hand, supervised learning techniques that identify the optimal number of clusters and reduce the dimension of the predictor matrix. We evaluated the method on numerical simulations and real datasets and compared the results with standard single-marker analysis and group-based approaches (SKATtree and SKATnotree). We remarked that the combination of our aggregating function with a ridge regression model leads to a major improvement in terms of predictive power when the linkage disequilibrium structure is strong enough, hence suggesting the existence of multivariate effects due to the combination of several SNP. These results remained consistent across two applications involving several binary traits (WTCCC and ankylosing spondylitis datasets).

In terms of the identification of associated loci in different simulation scenarios, our method demonstrates its ability to retrieve true causal SNP and/or clusters of SNP with substantially higher precision coupled with a good power. On real GWAS data, our method has been able to recover a genomic region associated with ankylosing spondylitis (HLA region on chromosome 6) with a higher precision than standard single-marker analysis.

By making use of the continuous nature of aggregated-SNP variables (in contrast to the ordinal nature of single SNP variables), we were able to further improve our method using generalized additive models and natural cubic splines. In terms of predictive power, the implementation of such models to the analysis of the AS data proved to be more efficient compared to linear regression models such as group-lasso, lasso and ridge regression. As for the detection of non-linear behaviour, the results obtained on the AS dataset show interesting non-linear patterns between some aggregated-SNP in the specific HLA region of chromosome 6 and the phenotype. However, the use of cubic splines has not been able to identify chromosome regions different from those previously identified with a classical linear regression model. It could thus be interesting to analyse other datasets with this methodology to see if we are able to detect any relevant associations ever identified before.