3.1 Introduction
Linkage analysis (Section 1.4 was the traditional approach for disease gene mapping, where the co-segregation of marker alleles with disease within large pedigrees or smaller family is studied. This approach is efficient for locating genes contributing to simple Mendelian disorders where there is a strong relationship between phenotype and genotypes at the underlying functional polymorphisms. However, it proved to be less reliable regarding mapping of complex diseases as there may be multiple interacting genes underlying these phenotypes and that the effects of these genes may vary according to exposure to environmental and other non-genetic risk factors.
Whole Genome Association studies (WGA) focus on identifying genetic markers that occur with different frequencies between samples of unrelated affected individuals and unaffected controls, exploiting the fact that it is easier to establish large cohorts of affected individuals sharing a genetic risk factor for a complex disease across the whole population than within individual families, as it is required for traditional linkage analysis. WGA rely in two types of association study: direct association and indirect association. On one hand, direct association focus on directly genotyping and studying functional polymorphisms which have relatively high prior probability of functional relevance such as non-synonymous polymorphisms6, splice-site variants7, and copy number polymorphisms (CNP8). One the other hand, indirect association, also referred as Genome-Wide Association Study (GWAS), focuses on both functional SNP, such as non-synonymous SNP, and those flanking them. Even if the flanking SNP are themselves unlikely to be directly associated with the phenotype, at sufficiently high density one or more is likely to be correlated (i.e. in linkage disequilibrium, see Section 1.6 with the underlying causal variants.
Furthermore, recent breakthroughs in micro-array technology have meant that hundreds of thousands of SNP can now be densely genotyped at moderate cost. As a result, it has become possible to characterize the genome of an individual with up to a million genetic markers. These rapid advances in DNA sequencing technologies have also made it possible to carry out exome and whole-genome sequencing studies of complex diseases. In this context, Genome-Wide Association Studies have been widely used to identify causal genomic variants9 implied in the expression of different human diseases (rare, Mendelian or multifactorial diseases). Thanks to the Next Generation Sequencing techniques, it is now possible to genotype the complete DNA sequence of an individual at a moderate cost, around 1000 $ in 2016 (Wetterstrand 2016), and in a very short time. Consequently, it is reasonable to think that the SNP will be abandoned in favour of a complete genotype and it is therefore necessary to develop statistical methods that can handle this kind of massive data.
References
Wetterstrand, KA. 2016. “DNA Sequencing Costs: Data from the Nhgri Genome Sequencing Program (Gsp).” www.genome.gov/sequencingcostsdata.
A non-synonymous SNP is a SNP that modifies the protein sequence in opposition to a synonymous SNP.↩
A genetic alteration in the DNA sequence that occurs at the boundary of an exon and an intron (splice site). This change can disrupt RNA splicing, resulting in the loss of exons or the inclusion of introns leading to an altered protein-coding sequence.↩
A CNP is a normal variation in DNA due to the varying number of copies of a sequence within the DNA. Large-scale copy number polymorphisms are common and widely distributed throughout the genome.↩
In the remainder of the paper, the terms variant, marker, locus, SNP or polymorphism will equivalently refer to the variable studied in GWAS.↩