4.3 Numerical simulations

The performance evaluation described below was designed to assess the ability of our method to retrieve causal SNP or causal clusters of SNP under different simulation scenarios. For each scenario, we use a matrix \(\mathbf{Z}_{\text{HAPGEN}}\) of SNP generated by the software (Su, Marchini, and Donnelly 2011) with a sample size of 1000 individuals. This software allows to simulate an entire chromosome conditionally on a reference set of population haplotypes (from HapMap3) and an estimate of the fine-scale recombination rate across the region, so that the simulated data share similar patterns with the reference data. We generate the chromosome 1 (103 457 SNP) using the haplotype structure of CEU population (Utah residents with Northern and Western European ancestry from the CEPH collection) as reference set. The software allows to generate a controls-only matrix of SNP (no disease allele). We filtered this matrix according to the minor allele frequency to only keep SNP with a MAF greater than \(5\%\) thus reducing the size of \(\mathbf{Z}_{\text{HAPGEN}}\) to 60 179 SNP.

We generate a posteriori the phenotype using the logit model with a given set of causal SNP or cluster of SNP. The main difference between the different scenarios is to be found in the way that the case-control phenotype \(\mathbf{y}\) is simulated.

4.3.1 Simulation of the case-control phenotype

For each scenario, we simulated a case-control phenotype \(\mathbf{y}\) under a logistic regression model. The case-control phenotype is generated following a Bernoulli distribution function, following the conditional probability \(\mathbb{P}(\mathrm{Y}=1|\mathbf{H})\) with \(\mathbf{H} \in \mathbb{R}^{n \times \ell}\) a matrix constructed by sampling \(\ell\) causal variables from \(\mathbf{Z}_{\text{HAPGEN}}\).

The conditional probability is calculated using the logit model: \[\mathbb{P}(\mathrm{Y}=1|\mathbf{H}) = \frac{\exp(\beta_0 + \boldsymbol{\beta}^T \mathbf{H})} {1 + \exp(\beta_0 + \boldsymbol{\beta}^T \mathbf{H})},\] where \(\boldsymbol{\beta} = [\beta_1, \dots, \beta_\ell]\) is the vector of coefficients corresponding to the \(\ell\) predictors and \(\beta_0\) is the intercept defined as \({ \log \left(\frac{\pi}{(1-\pi)}\right)}\), with \(\pi\) the true prevalence of the disease in the population. The predictors are centered to have zero-mean before generating the vector of probability.

One way to have an association between the response and the predictors strong enough to be detected is to set large \(\boldsymbol{\beta}\) coefficients on the predictors. Indeed, there is a direct relationship between the odd ratio of a covariate and its corresponding coefficients in the logistic regression model given by \(OR_i = e^{(\boldsymbol\beta_i)}\) (Diaz-Quijano 2012). In our simulations, the difficulty of the problem, i.e. the power to detect an association, is linked to the number of causal predictors used to generate \(\mathbf{y}\) and the \(OR\) set to each predictor.

To simulate different scenarios, we considered the following parameters:

  1. Nature of the causal predictors:

    • Clusters of SNP: For each replicate, \(\ell = \lbrace 1, 2, 3 \rbrace\) genomic regions have been identified to be causal. These regions have been chosen among the matrix \(\mathbf{Z}_{\text{HAPGEN}}\) to have different levels of LD among the SNP that compose them. The average correlation coefficient among the SNP in these regions varies from \(r^2=0.6\) to \(r^2=0.85\) and the size of the region varies from 20 SNP to 60 SNP. Once identified, the causal regions were aggregated using the function described in Step 4.2.2 to construct a matrix \(\tilde{\mathbf{H}}\) of aggregated-SNP predictors. This matrix was then used to generate the case-control phenotype following \(\mathbb{P}(\mathrm{Y}=1|\tilde{\mathbf{H}})\). We will refer to this scenario as the SNPclus scenario.

    • Single SNP: In this scenario the phenotype was simulated by directly sampling SNP from the same causal regions identified in the SNPclus scenario. For each replicate, we chose 10 individuals SNP among each of these regions to construct a matrix \(\mathbf{H}\) with \(\ell = \lbrace 10 ,20 ,30 \rbrace\) single SNP predictors, depending on the number of causal regions . This matrix is then used to generate the case-control phenotype. The chosen SNP have a MAF varying from \(10 \%\) to \(30\%\). We will refer to this scenario as the singleSNP scenario.

  2. Number of causal predictors \(\ell\) and number of replicates:

    We performed \(5\) replicates for each combination \(\ell \times 2\) scenarios and we evaluate the average performance over these \(5\) replicates. For each scenario we considered from 1 to 3 causal genomic regions, thus, for SNPclus scenario, we used up to 3 causal aggregated-SNP predictors, and for the singleSNP scenario, up to \(10 \times 3 = 30\) causal single-SNP predictors to generate the phenotype.

  3. Odds ratio (\(\beta\) coefficients) of the causal predictors:

    For the SNPclus scenario we chose an equal OR of 2.7 for each causal aggregated predictor, corresponding to a \(\beta\) coefficient equal to 1. For the singleSNP scenario we chose an equal OR of 1.1 for each causal predictor, corresponding to a \(\beta\) coefficient equal to \(0.1\). The rationale behind these coefficients arises from the hypothesis that the combined effect of several low-effect SNP on the phenotype is stronger than the effects of each individual SNP.

As previously mentioned, we generated the phenotype using causal SNP simulated with the software. However, as commercial micro-arrays such as Affymetrix and Illumina arrays do not genotype the full sequence of the genome, some SNP are thereby unmapped and the marker density is in general lower than the HapMap marker density. That is why we chose, in our numerical simulation, to generate the phenotype with causal variables chosen from \(\mathbf{Z}_{\text{HAPGEN}}\) and to assess the performance of the methods using only those SNP which are mapped on a standard Affymetrix micro-array (about 23 823 mapped SNP in our case). By doing so, some causal SNP are not mapped on the commercial SNP set and thus simulations are more similar to real genome-wide analysis conditions.

4.3.2 Performance evaluation

4.3.2.1 Competitors

The objective of our method being to identify the optimal scale at which to perform association studies, we compared our proposal with several methods working at different genomic scales. The purpose is to assess the ability of each method to retrieve true causal genomic regions in the different simulation scenarios.

For each scenario, four approaches have been considered:

  • SKATtree, a SKAT model, as described in Section (rare-variant), which use our group definition,

  • SKATnotree, a SKAT model using an alternative group definition produced by successive chunks of 20 SNP,

  • SMA, the classical Single Marker Analysis described in Section 3.4,

  • SASA (Single Aggregated-SNP Analysis) a method close to SMA, where instead of testing the genotype-phenotype association using each single SNP, we are testing it using aggregated-SNP variables.

We chose to consider two different group definitions for SKAT in order to evaluate the impact of the group structure on the association findings. The comparison with SMA allows to highlight the advantage of working at a group scale. We hypothesize that grouping low-effect SNP should have a better statistical power than testing the main effects at single-SNP level.

For all methods, we compare the results using 2 types of multiple testing corrections: the methods of Holm-Bonferroni (Holm 1979b) and (Benjamini and Hochberg 1995).

4.3.2.2 True and False Positive definitions

The problem of retrieving true causal associations can be represented as a binary decision problem where the compared methods are considered as classifiers. The decision made by a binary classifier can be summarized using four numbers: True Positives (\(TP\)), False Positive (\(FP\)), True Negatives (\(TN\)) and False Negatives (\(FN\)). We represent True Positive Rate (\(\text{Recall or Power} ={TP/(FN+TP)}\)) versus Precision (\(\text{Precision}={TP/(FP+TP)}\)). In this context, a true positive corresponds to a true causal genomic region associated to significant p-value. The definition of what can be considered as the true causal genomic region may nevertheless be subject to some ambiguity. In GWAS, the presence of LD between SNP often leads to consider the signal associated to multiple neighbouring SNP as indicating the existence of a single genomic locus with possible influence on the phenotype.

In our simulations, a causal genomic region is defined a priori as a causal predictor in the logit model. However, since the clusters of SNP identified by our algorithm are not totally independent, some residual correlation may remain between clusters. This leads to question the notion of relevant variable when the variables are structured into strongly correlated groups. Should all the variables of the block be considered as explanatory, or should we define as only true positives the causal variables used to generate the phenotype? In order to circumvent this issue, we chose to relax the definition of a false positive joining the work of (Brzyski et al. 2017) and (Yi et al. 2015) where they propose to control the FDR in GWAS by considering significant SNP correlated to the true causal variables as true positives. For the simulation of the phenotype, we hypothesize an underlying multivariate regression model, but test for univariate model as it is the usual practice, which leads to reconsider the definition of true positive. As in (Yi et al. 2015) we consider the set of true positive as the union of the causal true positive and the linked true positive, which are regions adjacent to the causal regions and correlated with them at a level of at least 0.5. Regarding the single-marker analysis approach, since it works at the single SNP level, we compare it with the others in the singleSNP scenario only.

References

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B 57 (1): 289–300.

Brzyski, Damian, Christine B. Peterson, Piotr Sobczyk, Emmanuel J. CandÚs, Malgorzata Bogdan, and Chiara Sabatti. 2017. “Controlling the Rate of GWAS False Discoveries.” Genetics 205 (January). https://doi.org/10.1534/genetics.116.193987.

Diaz-Quijano, Fredi A. 2012. “A Simple Method for Estimating Relative Risk Using Logistic Regression".” BMC Medical Research Methodology 1 (1): 14.

Holm, Sture. 1979b. “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics 6 (2): 65–70.

Su, Z., J. Marchini, and P. Donnelly. 2011. “HAPGEN2: Simulation of Multiple Disease Snps.” Bioinformatics 27 (16): 2304.

Yi, Hui, Patrick Breheny, Netsanet Imam, Yongmei Liu, and Ina Hoeschele. 2015. “Penalized Multimarker Vs. Single-Marker Regression Methods for Genome-Wide Association Studies of Quantitative Traits.” Genetics 199 (1): 205–22.