3.6 Population structure

One of the most important covariate to consider in GWAS is the measure of population structure which, if not accounted for, can inflate the false positive error rate. As stated in Section @ref(#originLD), we know that population stratification as an important impact on patterns of LD and allele frequencies are highly variable across human subpopulations, meaning that in a sample with multiple strata, strata-specific SNP will likely be associated to the trait due to population structure. As a result, SNP with allele frequency differences between the strata will appear to be associated with disease, even if there is no association within each stratum. Several methods to identify and adjust for population stratification have been developed of which the most commonly used are genomic control, structured association and principle components correction (Balding, Bishop, and Cannings 2008).

3.6.1 Genomic control

Under the null hypothesis of no disease association, the distribution of Cochran–Armitage test statistics is \(\chi_{CA}^2\) with 1 d.f. However, in a stratified population, we expect different allele frequency at many SNP and hence an excess of false positive signals of association. As a result, the observed distribution of association statistics will be inflated by a genomic inflation factor \(\lambda\) (Devlin and Roeder 1999). The genomic inflation factor \(\lambda\) is defined as the ratio of the median of the empirically observed distribution of the test statistic to the expected median: \[\lambda=\text{median}(\chi_{CA}^2)/0.456.\]

The genomic control method takes account of structure by a linear rescaling of observed test statistics to approximately restore the \(\chi_{CA}^2\) with 1 d.f null distribution: \[\chi^2_{CA-adj}=\chi_{CA}^2/\lambda.\]

3.6.2 Structured association

The method known as structured association, implemented in the STRUCTURE software (Pritchard, Stephens, and Donnelly 2000), uses an admixture model¹⁰ where the proportion of an individual’s genome into \(K\) specific ancestral strata is treated as unknown. The posterior distribution of ancestry for each individual is then approximated using bayesian Markov Chain Monte Carlo (MCMC) methods based on genotype information from several hundred genome-wide SNP and the estimated structure is then included as covariates in a logistic regression framework. The main drawback of this approach is that the number of ancestral subpopulations must be inferred using an ad hoc estimation procedure and the computational load of the MCMC algorithm is such that it cannot accommodate for the numbers of markers commonly used in GWAS.

3.6.3 Principle components correction

This method makes use of the Principal Component Analysis (PCA) to detect and correct for population structure. In PCA, the few first principal components, calculated using the eigen-decomposition of a matrix, explain the greatest amount of variation in the data and has long been used to study population structure in genetic data (Reich, Price, and Patterson 2008). In GWAS, PCA has been used to explicitly model ancestry differences between cases and controls along continuous axes of variation and the first principle components may be used as covariates in a logistic regression model to adjust for the population structure effect. PCA being a computationally efficient algorithm, this approach has the advantage that it can be applied to datasets with more than \(1.10^5\) SNP.

The software EIGENSTRAT (Price et al. 2006) use this approach by computing an adjusted test statistic defined as follow: \[\chi^2_{eigen} = (n-k-1) r^2(\mathbf{z}_m^{adj},\mathbf{y}^{adj}),\] where \(\mathbf{z}_m^{adj}\) is the adjusted genotype at marker \(m\), defined as the residuals after regressing genotypes on the top \(k\) principal components. The adjusted phenotype \(\mathbf{y}^{adj}\) is similarly defined. The test statistic \(\chi^2_{eigen}\) approximately follows a \(\chi^2\) distribution with 1 d.f under the null hypothesis of no association. It has been shown that the EIGENSTRAT method has a higher power than genomic control because the correction in EIGENSTRAT is specific to a variation in frequency of a candidate marker across ancestral populations, which will minimize spurious associations as well as maximize power to detect true associations (Price et al. 2006) .

References

Balding, David J, Martin Bishop, and Chris Cannings. 2008. Handbook of Statistical Genetics. John Wiley & Sons.

Devlin, Bernie, and Kathryn Roeder. 1999. “Genomic Control for Association Studies.” Biometrics 55 (4): 997–1004.

Price, A. L., N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich. 2006. “Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies.” Nature Genetics 38: 904–9.

Pritchard, Jonathan K, Matthew Stephens, and Peter Donnelly. 2000. “Inference of Population Structure Using Multilocus Genotype Data.” Genetics 155 (2): 945–59.

Reich, David, Alkes L Price, and Nick Patterson. 2008. “Principal Component Analysis of Genetic Data.” Nature Genetics 40 (5): 491.

An admixture model is a statistical model taking in account the phenomenon known as population admixture (see Section @ref(#originLD)).↩