3.5 Limitations

The classical Single Marker Analysis approach is subject to false positives (i.e. SNP that are falsely identified as significant variables) due to the number of tests performed at the same time. One way around this problem is to apply a correction for multiple comparisons as described in Section 2.6.5. Unfortunately, this increases the risk of missing true associations that have only a small effect on the phenotype, which is usually the case in GWAS. Indeed, simultaneously testing \(1.10^5\) SNP with single marker analysis would require that the associated p-value reach a threshold of at least \(5.10^{-5}\), using a Bonferroni correction, to be consider as significant and a little higher with FDR control method.

Furthermore, another commonly used approach for multiple testing comparisons in GWAS relies on the concept of genome-wide significance. It is based on the distribution of LD in the genome for a specific population and consider that there are an “effective” number of independent genomic regions, and thus an effective number of statistical tests that should be corrected for. For European-descent populations, this threshold has been estimated at \(7.2.10^{-8}\) (Dudbridge and Gusnanto 2008). This approach should however be used with caution since the only scenario where this correction is appropriate is when hypotheses are tested on the genome scale. Candidate gene studies or replication studies with a focused hypothesis do not require correction to this level, as the number of effective, independent statistical tests is much lower than what is assumed for genome-wide significance (Bush and Moore 2012).

Furthermore, as stated in (Maher 2008), these approaches face other limitations:

  • It does not directly account for correlations among the predictors, whereas these correlations can be very strong as a result of linkage disequilibrium (LD). SNP can be correlated even where they are not physically linked, because of population structure or epistasis (gene by gene interactions).

  • It does not account for epistasis, i.e. causal effects that are only observed when certain combinations of mutations are present in the genome.

  • It does not directly provide predictive models for estimating the genetic risk of the disease.

  • It focuses on identifying common markers with minor allele frequency (MAF) above 5\(\%\), although it is likely that analysing low-frequency (\(0.5\% <\) MAF \(< 5\%\)) and rare (MAF \(<0.5\%\)) variants would be able to explain additional disease risks or trait variability (Lee et al. 2014).

Uncovering some of the missing heritability can sometimes be achieved by taking into account correlations among variables, interaction with the environment, and epistasis, but this is rarely feasible in the context of GWAS because of the multiple testing burden and the high computational cost of such analyses (Manolio and Visscher 2009). That is why, knowing these limitations, we propose in Chapter 4 a new approach that take benefit of the correlation structure among SNP to improve statistical power in GWAS.

References

Bush, William S, and Jason H Moore. 2012. “Genome-Wide Association Studies.” PLoS Computational Biology 8 (12): e1002822.

Dudbridge, Frank, and Arief Gusnanto. 2008. “Estimation of Significance Thresholds for Genomewide Association Scans.” Genetic Epidemiology 32 (3): 227–34.

Lee, S., G. R. Abecasis, M. Boehnke, and X. Lin. 2014. “Rare-Variant Association Analysis: Study Designs and Statistical Tests.” American Journal of Human Genetics 95 (1): 5–23.

Maher, B. 2008. “Personal Genomes: The Case of the Missing Heritability.” Nature News 456 (7218): 18–21.

Manolio, T. A., and P. M. Visscher. 2009. “Finding the Missing Heritability of Complex Diseases.” Nature 461 (7265): 747–53.