3.7 Multi-locus analysis

As previously mentioned, in GWAS it is necessary for the SNP to be correlated to the causal polymorphisms in order to have an efficient disease mapping and, in complex disease, each single SNP have small effects on the phenotype. In this section we will show that we can take benefit from performing joint association tests of multiple SNP flanking a causal polymorphism to increase power in the case of rare-variant analysis or when the genetic effects are too small to be detected by single-locus approaches.

Several ways of grouping SNP together for multi-locus analysis are possible; we may consider to group SNP that fall within an established biological context such as a biochemical pathway, protein family, or gene. We can also consider working at haplotypes level rather than the genotypes and used the haplotype structure of the genome to define relevant groups (Section 3.7.2 and 3.7.3).

Performing multi-locus analysis is not as straightforward as single marker analysis and presents some computational and statistical challenges. The most commonly used model to regress multiple SNP is the multiple linear regression with which we can simultaneously fit all SNP in the same gene or small genomic region. To reduce the problem of collinearity and overfitting that may arise, we can resort to penalized approaches, as described in Section 2.3.2, such as ridge, lasso or group-lasso regression models. Furthermore, with such models, it is also possible to examine statistical interactions among genetic variants and so to investigate epistatic effects as in (Stanislas, Dalmasso, and Ambroise 2017).

Other methods using multiple linear regression take into account the linkage disequilibrium within the genes to improve power (Yoo et al. 2016) or cluster variants with weak association around known loci to increase the percentage of variance explained in complex traits (ParÃ©, Asma, and Deng 2015). Finally, other approaches will focus on the aggregation of summary statistics of single SNP within a same gene with for instance the data-driven aggregation of summary statistics described in (Kwak and Pan 2016) or the procedures of \(p\)-value combination in (Petersen et al. 2013).

3.7.1 Haplotype-based approaches

One approach to multi-locus analysis is to focus on haplotype effects. As seen in Section 1.7, the human genome can be partitioned into haplotype blocks where most of the intra-block variability is imputable to mutation rather than recombination. As a result, much of common genetic variation can also be structured into haplotypes within blocks of LD that are rarely disturbed by meiosis.

It is common to assume that each of the pair of haplotypes, \(H_{i1}\) and \(H_{i2}\), labelled according to their relative frequency in the population and forming the diplotype \(H_i\) of the \(i^{th}\) individual, contributes independently to the disease risk. Under this assumption the logistic regression model can be parametrised in terms of the log odds of disease for each haplotype (Balding, Bishop, and Cannings 2008). The linear predictor \(\eta_i\) of the \(i^{th}\) individual, as defined in Section 3.4.3, can thus be defined as:

\[\eta_i = \beta_0 + \sum_{j=1}^p \alpha_j x_{ij} + \beta_{H_{i1}} + \beta_{H_{i2}},\] where \(x_{ij}\) is the response of the \(i^{th}\) individual to the \(j^{th}\) covariate and \(\alpha_j\) its corresponding coefficient. \(\beta_k\) denotes the log-OR of the \(k^{th}\) most frequent haplotype, relative to the baseline haplotype, usually the most common, so that \(\beta_1 = 0\).

One major issue with this approach is that we do not directly observe the diplotype \(H\) from the unphased genotype data. One solution is to take a point estimate of the diplotype for each individual, using statistical methodology, such as PHASE (Stephens, Smith, and Donnelly 2001) or by maximum likelihood using the expectation-maximisation algorithm (Excoffier and Slatkin 1995). However, due to the uncertainty in the haplotype reconstruction process, the variances of the model parameters are under-estimated leading to an inflation of type I error (Balding, Bishop, and Cannings 2008).

3.7.2 Rare-variant association analysis

In the context of rare-variant association analysis, a number of region- or gene-based multimarker tests have been proposed as burden tests (Asimit et al. 2012), variance-component tests (Wu et al. 2011) or combined burden and variance-component tests (Lee, Wu, and Lin 2012). Instead of testing each variant individually, these methods evaluate the cumulative effects of multiple genetic variants in a gene or a region, increasing power when multiple variants in the group are associated with a given disease or trait.

We first introduce the statistical model used in various rare-variant tests and that is again based on the logistic regression framework defined in Section 3.4.3. We assume that \(n\) individuals have been genotyped in a region comprising \(M\) genetic markers and defined the linear predictor \(\eta_i\) of the \(i^{th}\) individual as:

\[\eta_i = \beta_0 + \sum_{j=1}^p \alpha_j x_{ij} + \sum_{m=1}^M \beta_m z_{im}, \label{eq:linearpred}\]

where \(z_{im} = z_{(A)im}\) is the variable representing the additive component of the \(i^{th}\) individual for the \(m^{th}\) marker and \(x_{ij}\) the response of the \(i^{th}\) individual to the \(j^{th}\) covariate. We define the score statistic of the model for variant \(m\) as \[S_m = \sum_{i=1}^n z_{im} (y_i - \eta_i).\] Note that \(S_m\) is positive when marker \(m\) is associated with increased disease risk or trait values and negative when marker \(m\) is associated with decreased risk or trait values.

Burden tests

Burden tests (Asimit et al. 2012; Li and Leal 2008) compute a single genetic score from multiple genetic markers and test for association between this score and a phenotype of interest. A simple approach summarizes genotype information by counting the number of minor alleles across all variants in the set. The summary genetic score is then: \[C_i = \sum_{m=1}^M \omega_m z_{im}, \label{eq:burden_fun}\] where \(w_m\) is a weight attributed to marker \(m\). The linear predictor can thus be written as: \[\eta_i = \beta_0 + \sum_{j=1}^p \alpha_j x_{ij} + \beta_1 C_i.\]

To compute a \(\textit{p}\)-value for a set of \(M\) markers, the specific test statistic \(Q_{burden}\) is calculated and compared to a \(\chi^2\) distribution with 1 d.f.: \[Q_{burden} = \left[ \sum_{m=1}^M \omega_m S_m \right]^2.\]

This framework is flexible in the sense that we can attribute different weights to the markers or define the genetic score \(C_i\) to accommodate for different assumptions about disease mechanism. For instance, the cohort allelic sums test (CAST, Morgenthaler and Thilly 2007) assumes that the presence of any rare variant increases disease risk and sets the genetic score \(C_i = 0\) if there are no minor alleles in a region and \(C_i = 1\) otherwise. Furthermore, to focus on rarer variants, we can assign \(w_m = 1\) when the MAF of variant \(m\) is smaller than a prespecified threshold and \(w_m = 0\) otherwise. Alternatively, a continuous weight function can be used to upweight rare variants with for instance \(\omega_m = 1 / \sqrt{\text{MAF}_m (1 - \text{MAF}_m)}\) as proposed by (Madsen and Browning 2009).

The burden methods make a strong assumption that all rare variants in a set are causal and associated with a trait with the same direction and magnitude of effect (after adjustment for the weights) which may in a substantial loss of power if these assumptions prove to be false (Lee et al. 2014).

Sequence Kernel Association Test (SKAT)

In (Wu et al. 2010), the authors proposed to group SNP into sets on the basis of their proximity to genomic features such as genes or haplotype blocks and then to identify the joint effect of each set via a logistic kernel-machine-based test. This approach lays the foundation for the Sequence Kernel Association Test method (SKAT, Wu et al. 2011).

SKAT uses the same logistic regression framework and the linear predictor as with burden tests but instead of testing the null hypothesis \(H_0: \beta_1, \dots, \beta_M = 0\), it assumes that each \(\beta_m\) follows an arbitrary distribution with a mean of zero and variance of \(\omega_m \tau\) where \(\tau\) is a variance component and \(\omega_m\) the weight attributed to marker \(m\). With this assumption, we can see that \(H_0: \beta_1, \dots, \beta_M = 0\) is equivalent to test \(H_0: \tau = 0\) which can be efficiently tested with a variance-component score test as used in generalized linear mixed model (GLMM) and is known to be a locally most powerful test (Lin 1997). An advantage of this score test is that it requires to fit only the null model and to compute the following variance-component score statistic: \[Q_{SKAT} = \sum_{i=1}^n \sum_{i'=1}^n (y_i - \eta_i) K(\mathbf{z}_i,\mathbf{z}_{i'}) (y_{i'} - \eta_{i'})\] where \(\eta_i = \beta_0 + \sum_{j=1}^p \alpha_j x_{ij}\) is the linear predictor of the null model including only the \(p\) covariates for individual \(i\), and where \(K(\mathbf{z}_{i},\mathbf{z}_{i'}) = \sum_{m=1}^M \omega_m z_{im} z_{i'm}.K(.,.)\) is called the kernel function and measures the genetic similarity between individuals \(i\) and \(i'\), weighted by a factor \(\omega_m\), via the \(M\) genetic markers in the region of interest. This particular form of \(K(.,.)\) is called the weighted linear kernel function and can take several forms to accommodate for epistatic effects for instance. In fact, any positive semi-definite function can be used as a kernel function and in their paper, (Wu et al. 2011) tailored the following commonly used kernels specifically for the purpose of rare-variant analysis:

The weighted linear kernel: \[K(\mathbf{z}_{i},\mathbf{z}_{i'}) = \sum_{m=1}^M \omega_m z_{im} z_{i'm}\] implies a linear relationship between the trait of interest and the genetic variants and is equivalent to the classical linear and logistic model described in Section 3.4.3.
The weighted quadratic kernel: \[K(\mathbf{z}_{i},\mathbf{z}_{i'}) = (1 + \sum_{m=1}^M \omega_m z_{im} z_{i'm})^2\] assumes that the model depends on the main effects and quadratic terms for the gene variants and the first-order variant by variant interactions.
The weighted identity by state (IBS¹¹) kernel: \[K(\mathbf{z}_{i},\mathbf{z}_{i'}) = \sum_{m=1}^M \omega_m IBS(z_{im},z_{i'm})\] defines similarity between individuals as the number of alleles that share IBS.

Under the null hypothesis, \(Q_{SKAT}\) follows a mixture of chi-square distributions, which can be closely approximated with the computationally efficient Davies method (Davies 1980).

3.7.3 LD based approach to variable selection in GWAS

Region-based multi-marker analysis necessarily need that we define a group structure among SNP either by using the gene definition or biochemical pathway. However, these approaches limit the search for association to coding region only and therefore potential interesting associations located in non-coding region¹² are set aside.

One way to circumvent this issue is to use non-supervised clustering techniques such as hierarchical clustering described in Section 2.5.1. In their paper, (A. Dehman, Ambroise, and Neuvial 2015) proposed an approach where they used a modified version of the hierarchical clustering combined with a group-lasso regression to select groups of markers associated with phenotype of interest. The clustering method used is a spatially constrained agglomerative hierarchical clustering based on Ward’s criterion in which the measure of dissimilarity is not based on the Euclidean distance but rather on the linkage disequilibrium level between two markers: \(1 - r^2(m,m')\). The algorithm also makes use of the fact that the LD matrix can be modelled as block-diagonal by allowing only groups of variables that are adjacent on the genome to be merged, which significantly reduces the computation cost (adjclust, Alia Dehman 2015 b).

The number of groups is then determined using a modified version of the gap statistic defined in Section 2.5.1:

\[Gap(g) = \frac{1}{B} \sum_{b=1}^B(I_{W_g}^b - I_{W_g}),\] where for \(b = 1,\dots,B\), \(I_{W_g}^b\) denotes the within-cluster dispersion of clustering the reference dataset \(b\) in \(g\) groups. They decided to use the \(I_{W_g}\) instead of \(\log(I_{W_g})\) in estimation since they noticed that it led to better estimation of the number of groups in the simulation studies, which were performed under a variety of parameters and on several data sets.

Finally, once the LD-defined groups structure have been determined, a group-lasso regression is performed in order to select groups of SNP associated with the phenotype. Given a phenotype vector \(\mathbf{y}\) and a scaled matrix \(\mathbf{Z}\) of additively coded SNP, the group-lasso estimate is defined as:

\[\hat{\beta}^{GL} = \underset{\boldsymbol{\beta}}{\text{argmin}} \left[ ||\mathbf{y} - \mathbf{Z}\boldsymbol{\beta}||^2_2 + \lambda \sum_{g=1}^G \sqrt{p_g}||\boldsymbol{\beta}^g||_2 \right],\] where \(||.||\) denotes the euclidean norm, \(\lambda > 0\) is a penalty factor and \(\boldsymbol{\beta}_g\) the vector of regression coefficients corresponding to the \(g^{th}\) group, so that \(\boldsymbol{\beta} = (\boldsymbol{\beta}^1,\dots,\boldsymbol{\beta}^G)\).

References

Asimit, J. L., A. G. Day-Williams, A. P. Morris, and E. Zeggini. 2012. “ARIEL and AMELIA: Testing for an Accumulation of Rare Variants Using Next-Generation Sequencing Data.” Human Heredity 73 (2): 84–94.

Balding, David J, Martin Bishop, and Chris Cannings. 2008. Handbook of Statistical Genetics. John Wiley & Sons.

Davies, Robert B. 1980. “Algorithm as 155: The Distribution of a Linear Combination of \(\chi\) 2 Random Variables.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 29 (3): 323–33.

Dehman, A., C. Ambroise, and P. Neuvial. 2015. “Performance of a Blockwise Approach in Variable Selection Using Linkage Disequilibrium Information.” BMC Bioinformatics 16: 148.

Dehman, Alia. 2015b. “Spatial Clustering of Linkage Disequilibrium blocks for Genome-Wide Association Studies.” Theses, Université d’Evry Val d’Essonne ; Université Paris-Saclay ; Laboratoire de Mathématiques et Modélisation d’Evry. https://tel.archives-ouvertes.fr/tel-01288568.

Excoffier, Laurent, and Montgomery Slatkin. 1995. “Maximum-Likelihood Estimation of Molecular Haplotype Frequencies in a Diploid Population.” Molecular Biology and Evolution 12 (5): 921–27.

Kwak, Il-Youp, and Wei Pan. 2016. “Adaptive Gene- and Pathway-Trait Association Testing with GWAS Summary Statistics.” Bioinformatics 32: 1178–84. https://doi.org/10.1093/bioinformatics/btv719.

Lee, S., G. R. Abecasis, M. Boehnke, and X. Lin. 2014. “Rare-Variant Association Analysis: Study Designs and Statistical Tests.” American Journal of Human Genetics 95 (1): 5–23.

Lee, S., M. C. Wu, and X. Lin. 2012. “Optimal Tests for Rare Variant Effects in Sequencing Association Studies.” Biostatistics 13 (4): 762–75.

Li, Bingshan, and Suzanne M Leal. 2008. “Methods for Detecting Associations with Rare Variants for Common Diseases: Application to Analysis of Sequence Data.” The American Journal of Human Genetics 83 (3): 311–21.

Lin, Xihong. 1997. “Variance Component Testing in Generalised Linear Models with Random Effects.” Biometrika 84 (2): 309–26.

Madsen, Bo Eskerod, and Sharon R Browning. 2009. “A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic.” PLoS Genetics 5 (2): e1000384.

Morgenthaler, Stephan, and William G Thilly. 2007. “A Strategy to Discover Genes That Carry Multi-Allelic or Mono-Allelic Risk for Common Diseases: A Cohort Allelic Sums Test (Cast).” Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis 615 (1): 28–56.

ParÃ©, Guillaume, Senay Asma, and Wei Q. Deng. 2015. “Contribution of Large Region Joint Associations to Complex Traits Genetics.” PLOS Genetics 11. https://doi.org/10.1371/journal.pgen.1005103.

Petersen, Ashley, Carolina Alvarez, Scott DeClaire, and Nathan L. Tintle. 2013. “Assessing Methods for Assigning SNPs to Genes in Gene-Based Tests of Association Using Common Variants.” PLOS ONE 8. https://doi.org/10.1371/journal.pone.0062161.

Stanislas, Virginie, Cyril Dalmasso, and Christophe Ambroise. 2017. “Eigen-Epistasis for Detecting Gene-Gene Interactions.” BMC Bioinformatics 18 (1): 54.

Stephens, Matthew, Nicholas J Smith, and Peter Donnelly. 2001. “A New Statistical Method for Haplotype Reconstruction from Population Data.” The American Journal of Human Genetics 68 (4): 978–89.

Wu, M. C., P. Kraft, M. P. Epstein, D. M. Taylor, S. J. Chanock, D. J. Hunter, and X. Lin. 2010. “Powerful SNP-Set Analysis for Case-Control Genome-Wide Association Studies.” American Journal of Human Genetics 86 (6): 929–42.

Wu, M. C., S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin. 2011. “Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test.” American Journal of Human Genetics 89 (1): 82–93.

Yoo, Yun Joo, Lei Sun, Julia G. Poirier, Andrew D. Paterson, and Shelley B. Bull. 2016. “Multiple Linear Combination (MLC) Regression Tests for Common Variants Adapted to Linkage Disequilibrium Structure: Yoo et Al.” Genetic Epidemiology 41. https://doi.org/10.1002/gepi.22024.

A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment.↩
Non-coding DNA (formerly referred as ’junk’ DNA) represents 99\(\%\) of the genome and does not provide instructions for making proteins. However, recent studies have shown that non-coding DNA sequences can act as regulatory elements like sites for transcription factors implied in the control of gene transcription (source: https://ghr.nlm.nih.gov/primer/basics/noncodingdna).↩