5.1 Introduction

5.1.1 Background

GWAS are a powerful tool for investigating the genetic architecture of complex diseases and have been successful in identifying hundreds of associated variants. However, they have been able to explain only a small proportion of the disease heritability calculated from classical family studies. As previously stated in Section 3.5, it is nonetheless possible to uncover some of the missing heritability by taking into account correlations among variables, interaction with the environment and epistasis, but not without some difficulties due to the multiple testing burden.

Other avenues to explain the variability in some traits of interest have yet to be explored, for instance an interesting lead would be to consider the contribution of microbial communities on the expression of a phenotype. Indeed, there is growing evidences of the role of gut microbiota in basic biological processes and in the development and progression of major human diseases such as infectious diseases, gastrointestinal cancers, metabolic diseases…(Wang et al. 2017). In plants, the role of rhizosphere13 microflora on plant growth is well known and has been widely studied (Mukerji, Manoharachary, and Chamola 2002; Pinton, Varanini, and Nannipieri 2007).

Analysis equivalent to GWAS have been conducted using the metagenome14 rather than the genome of an individual and are known as Metagenome Wide Association Study (MWAS) (J. Wang and Jia 2016; Segata et al. 2011). Those metagenome association analyses may often explain larger variation of the phenotype than classical GWAS and have been successful in finding relevant association for complex pathologies such as obesity, Crohn’s disease, colorectal cancer…

5.1.2 Combining genome and metagenome analyses.

One possible way to relate genetic and metagenomic data consists in considering the metagenome as phenotype and thus performing quantitative trait locus (QTL) mapping. This kind of metagenome QTL analysis demonstrates the role of host genetics in shaping metagenomic diversity between individuals (J. Wang et al. 2016; Srinivas et al. 2013).

Another possibility for taking into account both type of variables consists in including metagenomic variables as environmental variables in GWAS. In that case interactions may naturally be modelled using a classical generalized linear model with interactions terms (Lin et al. 2013).

The main drawback of the later idea lies in the number of interactions to test, both datasets having a large number of variables. In order to reduce the dimension of the problem, variable selection or variable compression may be of use.

5.1.3 Taking structures into account in association studies.

Data compression for dimension reduction may be achieved in various ways. A usual distinction is often established between feature selection and feature extraction. Feature selection consists in selecting few relevant variables among the original ones, while feature extraction consists in computing new representative variables.

In our problem of association study, feature selection is often preferred to feature extraction for interpretative purposes. In this chapter, we advocate for a mixed approach which combines feature extraction and feature selection. The basic idea relies in grouping close variables via an unsupervised approach. Supervariables are computed to summarize the information of each cluster of variables and eventually the best supervariables are selected using a penalized regression approach.

We already investigate the idea of considering groups of variables in Chapter 4. It also has already been suggested in the context of MWAS in (Qin et al. 2012). In the context of prediction from gene expression regression, the method HCAR developed by (Park, Hastie, and Tibshirani 2007) described in Section 2.5.2 show that regressing over supergenes improves the precision if the correlation structure is strong enough. Moreover, (Mary-Huard and Robin 2009) proposed a strategy to deal with large-dimension datasets in classification, called aggregation. It consists in a clustering step of redundant variables, using kNN or Classification and Regression Tree (CART) algorithms, and a group-compression step. They develop a statistical framework to define tailored aggregation methods that can be combined with selection methods to build reliable classifiers with possible applications on microarray data.

The method SICOMORE presented in this chapter can be summarized as follows: (1) it uses a hierarchical clustering algorithm to identify a group structure within the data; (2) it compresses the hierarchical structure by averaging the groups as in HCAR; (3) it performs a lasso procedure on the compressed variables as in HCAR with a penalty factor weighted by the length of the gap between two successive levels of the hierarchy as in MLGL; (4) it performs multiple hypothesis testing in a linear model with interactions.

References

Lin, Xinyi, Seunggeun Lee, David C. Christiani, and Xihong Lin. 2013. “Test for Interactions Between a Genetic Marker Set and Environment in Generalized Linear Models.” Biostatistics 14 (4): 667–81.

Mary-Huard, Tristan, and Stephane Robin. 2009. “Tailored Aggregation for Classification.” IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (11): 2098–2105.

Mukerji, Krisha Gopal, C Manoharachary, and BP Chamola. 2002. Techniques in Mycorrhizal Studies. Springer Science & Business Media.

Park, Mee Young, Trevor Hastie, and Robert Tibshirani. 2007. “Averaged Gene Expressions for Regression.” Biostatistics 8 (2): 212–27.

Pinton, Roberto, Zeno Varanini, and Paolo Nannipieri. 2007. The Rhizosphere: Biochemistry and Organic Substances at the Soil-Plant Interface. CRC press.

Qin, Junjie, Yingrui Li, Zhiming Cai, Shenghui Li, Jianfeng Zhu, Fan Zhang, Suisha Liang, et al. 2012. “A Metagenome-Wide Association Study of Gut Microbiota in Type 2 Diabetes.” Nature 490 (7418): 55–60.

Segata, Nicola, Jacques Izard, Levi Waldron, Dirk Gevers, Larisa Miropolsky, Wendy S Garrett, and Curtis Huttenhower. 2011. “Metagenomic Biomarker Discovery and Explanation.” Genome Biology 12 (6): R60.

Srinivas, Girish, Steffen Möller, Jun Wang, Sven Künzel, Detlef Zillikens, John F Baines, and Saleh M Ibrahim. 2013. “Genome-Wide Mapping of Gene–Microbiota Interactions in Susceptibility to Autoimmune Skin Blistering.” Nature Communications 4.

Wang, Baohong, Mingfei Yao, Longxian Lv, Zongxin Ling, and Lanjuan Li. 2017. “The Human Microbiota in Health and Disease.” Engineering 3 (1): 71–82. https://doi.org/https://doi.org/10.1016/J.ENG.2017.01.008.

Wang, Jun, and Huijue Jia. 2016. “Metagenome-Wide Association Studies: Fine-Mining the Microbiome.” Nature Reviews Microbiology 14 (8): 508–22.

Wang, Jun, Louise B. Thingholm, Jurgita Skiecevičienė, Philipp Rausch, Martin Kummen, Johannes R Hov, Frauke Degenhardt, et al. 2016. “Genome-Wide Association Analysis Identifies Variation in Vitamin d Receptor and Other Host Factors Influencing the Gut Microbiota.” Nature Genetics.


  1. The rhizosphere is the term used to describe the zone of intense activity around the roots of leguminacea (Fabaceae) which contains a considerable diversity of microbial and mycorrhizal species.

  2. The metagenome corresponds to all the genetic material present in an environmental sample, consisting of the genomes of many individual organisms.