Conclusions
Since the last decade, the rapid advances in genotyping technologies have changed the way genes involved in mendelian disorders and complex diseases are mapped, moving from candidate genes approaches to linkage disequilibrium mapping, of which GWAS is a large-scale example. In the mid-1990s, some researchers already foresaw the coming of the GWAS era and the crucial contribution of high-throughput genotyping technologies in the field of genetic epidemiology. Indeed, (Risch and Merikangas 1996) noted that small genetic effects could be detected with greater power by association analyses and proposed that genome-wide LD mapping (GWAS) could be applied if technologies were developed to study SNP frequencies in all genes, contrasting in ill cases vs. control subjects. On another side, (Lander 1996) suggested the common disease common variant (CDCV) hypothesis and proposed cataloguing common SNP (with MAF \(\geq 5\%\)) and studying their association to disease in large samples. GWAS strategy under the CDCV hypothesis assumed that many different common SNP have small effects on each disease, and that some could be found by testing enough SNP in enough people.
Since 2005 ((Klein et al. 2005)), GWAS have produced strongly significant evidence that specific common DNA sequence differences among people influence their genetic susceptibility to many different common diseases (Manolio, Brooks, and Collins 2008). However, they are also subject to several limitations intrinsic to the types of data but also to the statistical methods used. On one side the strong correlations between genetic variants, population structure, epistasis or effect size of rare-variant are partly responsible for the missing heritability. But on the other hand, although the single marker method remains the most widely used approach in GWAS, its relevance may be called into question in the context of complex diseases.
The new methodologies developed during this PhD are therefore part of this context. We try with this manuscript to provide a thorough introduction to GWAS by reminding in a first time the genetic precepts fundamental to the understanding of our works but also by introducing the concept of statistical learning. We chose not only to detail several state-of-the-art methods used in GWAS but also to put a particular emphasis on statistical learning by devoting an entire chapter to it. This choice was motivated by the conviction that a multidisciplinary approach combining both biological and statistical learning knowledges can help to understand the limits of traditional methods used in GWAS but also to imagine potential levers for improvement in terms of methodology.
Discussions on LEOS algorithm
Based on the observation that baseline single-marker analysis in GWAS is strongly affected by the multiple testing burden due to the high dimensionality of the data leading to the inability to identify variants having small effect on phenotype, we first came up with the idea of aggregating SNP within a same LD block for a dimension-reduction purpose. This reasoning led to the development of the method LEOS, described in Chapter 4. In this work we proposed a four-step algorithm explicitly designed to take benefit of the linkage disequilibrium structure in GWAS data. LEOS combines, on the one hand, unsupervised learning methods that cluster correlated-SNP, and on the other hand, supervised learning techniques that identify the optimal number of clusters and reduce the dimension of the predictor matrix.
The evaluation of the method was carried out from both a predictive and explanatory point of view. One part of the method consist in finding the optimal group structure to construct a matrix on new aggregated-SNP variables using supervised learning techniques. We noticed, in the assessment of the method on simulated and real datasets, that the combination of our aggregating function with a ridge regression model leads to a major improvement in terms of predictive power when the linkage disequilibrium structure is strong enough, hence suggesting the existence of multivariate effects due to the combination of several SNP. Furthermore, when using high-dimensional generalized additive model (HGAM) in place of linear models, we remarked that we were able to further increase the predictive accuracy. These results suggest a first interesting feature of our method if one wants to predict a phenotype based solely on genetic markers, with possible application in personalized medicine. However, these preliminary results, although encouraging, must be subjected to additional tests such as a comparative analysis with other machine learning algorithms specialized in the predictive aspect. It also seems important to confirm the robustness of these results on other data sets and on replicative studies.
Although the predictive aspect of the algorithm is of crucial importance, the main objective we had in mind while developing the method was to find a way to increase statistical power and precision in GWAS. Regarding this matter, accounting for the linkage disequilibrium structure of the genome and aggregating highly-correlated SNP is seen to be a powerful alternative to standard marker analysis. Indeed, LEOS demonstrates its ability, in different simulation scenarios, to retrieve true causal SNP and/or clusters of SNP with substantially higher precision coupled with a good power than standard approaches. Even though it has been able to recover a genomic region known to be associated with ankylosing spondylitis, we have not been able to detect new genomic regions significantly associated with the disease, certainly suggesting that some effects might still be too small to be detected or that there are other causes that cannot be detected with this type of approach, such as effects of interactions with the environment or epistasis. We also investigated, using HGAM on the aggregated-SNP matrix, the possibility to detect non-linear relationship with the phenotype. Albeit the regions identified did not differ from those previously identified with a classical linear regression model, the results obtained on the AS dataset still point interesting non-linear patterns between some aggregated-SNP in the specific HLA region of chromosome 6 and the phenotype. Nevertheless, we remain convinced that generalized additive models could be of great benefit in GWAS, particularly in terms of predictive power but also in the identification of non-linear behaviour.
Discussions on SICOMORE algorithm
One possible way to understand the expression of certain diseases is to consider gene-environment interactions. Sensitivity to environmental risk factors for a disease may be inherited, leading to cases where individuals exposed to the same environment but with different genotypes can be affected differently, resulting in different disease phenotypes. In the context of medical genetics and epidemiology, the study of gene-environment interactions is of great importance. Indeed, if we estimate only the separate contributions of genes and environment to a disease, and ignore their interactions, we will incorrectly estimate the fraction of phenotypic variance attributable to genes, environment, and their joint effect. Restricting analysis of environmental factors in epidemiological studies to individuals who are genetically susceptible to the exposure should increase the magnitude of relative risks, increasing the confidence that the observed associations are not due to chance (Hunter 2005).
A possible lead to investigate gene-environment interactions is take into account the contribution of microbial communities on the expression of a phenotype. As previously stated, there is growing evidences of the role of microbiome in basic biological processes whether in progression of major human diseases or in plant growth. These facts motivated the development of a new statistical method to tackle the detection of such interactions in a GWAS context. This topic offers many statistical challenges, among which the way to deal with the multiple testing burden. That is why we choose to use the idea to compress the data, as with the LEOS method, and to combine several statistical learning methods to develop an algorithm dedicated to the search for statistical interactions, with a focus on genomic and metagenomic data.
The SICOMORE method, described in Chapter 5, advantageously uses the strengths of different existing methods to combine them in a powerful single algorithm. First of all, we constructed the hierarchy of the genetic data with a well-proven spatially-constrained hierarchical clustering adapted to SNP data developed by (A. Dehman, Ambroise, and Neuvial 2015). Secondly, taking the average values of strongly correlated predictors, such as SNP within the same LD-block, and use them into a predictive model has already proved by (Park, Hastie, and Tibshirani 2007) to be a powerful approach. Finally, we took benefit of the weighting scheme proposed by (Grimonprez 2016) for the selection of the supervariables in the lasso procedure where we used a penalty factor defined by the length of the gap in the hierarchical tree, as explained in Section 5.3.3.
We evaluated and compared the performance SICOMORE with others methods in terms of power and precision. The results have put forward that, in terms of precision, all methods exhibit weak performances mainly due to the fact that the algorithms select groups which contain too many variables. As for the statistical power, SICOMORE always exhibited in the numerical simulations the strongest recall compared to the other methods. The application of our method to the Medicago truncatula dataset highlighted some significant interactions between genomic and metagenomic features in relation with three different phenotypes. However, although promising, these results need to be confirmed by a relevant biological interpretation that will be carried out by a discussion with our collaborators from INRA who have gracefully provided us these data. This should allow to append a biological interpretation to these results in the paper to come (currently in a preprint state).
Despite these interesting results, SICOMORE is nonetheless subject to some limitations that need to be addressed in future works. First of all, although the lasso procedure to select the supervariables in both complementary datasets is relevant for a dimension-reduction purpose, it may induce some biases in the multiple testing procedure we use afterwards because we perform a variable selection step before adjusting the \(p\)-values. One way around this problem could be to use post-hoc inference for multiple comparisons (Goeman, Solari, and others 2011).
Secondly, as observed in the analysis of the Medicago truncatula dataset, the stability of the variable selection step is problematic. The use of a variable selection model other than the lasso may circumvent this issue, with for instance the Bolasso model (Bach 2008) where the author proposed to intersect the supports of replicated bootstrapped Lasso estimates for consistent model selection. In the same fashion, (Meinshausen and Bühlmann 2010) introduced the stability selection based on subsampling in combination with high-dimensional selection algorithms.
References
Bach, Francis R. 2008. “Bolasso: Model Consistent Lasso Estimation Through the Bootstrap.” In Proceedings of the 25th International Conference on Machine Learning, 33–40. ACM.
Dehman, A., C. Ambroise, and P. Neuvial. 2015. “Performance of a Blockwise Approach in Variable Selection Using Linkage Disequilibrium Information.” BMC Bioinformatics 16: 148.
Goeman, Jelle J, Aldo Solari, and others. 2011. “Multiple Testing for Exploratory Research.” Statistical Science 26 (4): 584–97.
Grimonprez, Quentin. 2016. “Selection de Groupes de Variables corrélées En Grande Dimension.” PhD thesis, Université de Lille; Lille 1.
Hunter, David J. 2005. “Gene–Environment Interactions in Human Diseases.” Nature Reviews Genetics 6 (4): 287.
Klein, Robert J, Caroline Zeiss, Emily Y Chew, Jen-Yue Tsai, Richard S Sackler, Chad Haynes, Alice K Henning, et al. 2005. “Complement Factor H Polymorphism in Age-Related Macular Degeneration.” Science 308 (5720): 385–89.
Lander, Eric S. 1996. “The New Genomics: Global Views of Biology.” Science 274 (5287): 536–39.
Manolio, Teri A, Lisa D Brooks, and Francis S Collins. 2008. “A Hapmap Harvest of Insights into the Genetics of Common Disease.” The Journal of Clinical Investigation 118 (5): 1590–1605.
Meinshausen, Nicolai, and Peter Bühlmann. 2010. “Stability Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4): 417–73.
Park, Mee Young, Trevor Hastie, and Robert Tibshirani. 2007. “Averaged Gene Expressions for Regression.” Biostatistics 8 (2): 212–27.
Risch, Neil, and Kathleen Merikangas. 1996. “The Future of Genetic Studies of Complex Human Diseases.” Science 273 (5281): 1516–7.