Statistical learning for omics association and interactions studies based on blockwise feature compression
Abstract
General introduction
Notations
Abbreviations
1
Basic concepts of molecular genetics
1.1
Genome description
1.2
Genome sequencing
1.2.1
DNA sequencing
1.2.2
Sequence assembly
1.3
DNA polymorphism
1.3.1
Restriction Fragment Length Polymorphisms (RFLP)
1.3.2
Simple Sequence Length Polymorphisms (SSLP)
1.3.3
Single Nucleotide Polymorphisms (SNP)
1.4
Linkage and partial linkage for genetic mapping
1.5
Basic concepts in population genetics
1.5.1
Hardy-Weinberg equilibrium in large population
1.5.2
Genetic drift in small population
1.5.3
Concept of heritability
1.6
Linkage disequilibrium
1.6.1
Definition
1.6.2
Measure of LD
1.6.3
Estimation of linkage disequilibrium
1.6.4
Origins of linkage disequilibrium
1.7
Structure of haplotype blocks in the human genome
Definition of haplotype blocks
Patterns in human genome
2
Statistical context
2.1
Notations
2.2
Concepts of statistical learning
2.2.1
Prediction
2.3
Parametric methods
2.3.1
Linear models
2.3.2
Penalized linear regression
2.3.3
Generalized linear models
2.4
Splines and generalized additive models: Moving beyond linearit
2.4.1
Introduction
2.4.2
Regression splines
2.4.3
\(\mathrm{B}\)
-splines
2.4.4
Cubic smoothing splines
2.4.5
Generalized additive models (GAM)
2.4.6
High-dimensional generalized additive models (HGAM)
2.5
Combining cluster analysis and variable selection
2.5.1
Hierarchical clustering
2.5.2
Hierarchical Clustering and Averaging Regression
2.5.3
Multi-Layer Group-Lasso (MLGL)
2.6
Statistical testing of significance
2.6.1
Introduction
2.6.2
\(\chi^2\)
test
2.6.3
Likelihood ratio test
2.6.4
Calculation of
p
-values in GAM
2.6.5
Multiple testing comparison
3
Genome-Wide Association Studies
3.1
Introduction
3.2
Genotype quality control
3.2.1
Deviation from HWE.
3.2.2
Missing data.
3.2.3
Distribution of test statistics.
3.3
Disease penetrance and odds ratio
3.4
Single Marker Analysis
3.4.1
Pearson’s
\(\chi^2\)
statistic
3.4.2
Cochran-Armitage trend test
3.4.3
Logistic regression and likelihood ratio test
3.5
Limitations
3.6
Population structure
3.6.1
Genomic control
3.6.2
Structured association
3.6.3
Principle components correction
3.7
Multi-locus analysis
3.7.1
Haplotype-based approaches
3.7.2
Rare-variant association analysis
3.7.3
LD based approach to variable selection in GWAS
4
Learning the Optimal in GWAS through hierarchical SNP aggregation
4.1
Related work
4.2
Method
4.2.1
Step 1. Constrained-HAC
4.2.2
Step 2. Dimension reduction function
4.2.3
Step 3. Optimal number of groups estimation
4.2.4
Step 4. Multiple testing on aggregated-SNP variables
4.3
Numerical simulations
4.3.1
Simulation of the case-control phenotype
4.3.2
Performance evaluation
4.4
Results
4.4.1
Results and discussions of the numerical simulations
4.4.2
Performance results for simulated data.
4.4.3
Application in Wellcome Trust Case Control Consortium(WTCCC) and Ankylosing Spondylitis (AS) studies
4.4.4
Results in WTCCC and AS studies
4.5
Generalized additive models in GWAS
4.5.1
Comparison of predictive power
4.5.2
Results of univariate smoothing splines on aggregated-SNP
4.6
Discussion
5
Selection of interaction effects in compressed multiple omics representation
5.1
Introduction
5.1.1
Background
5.1.2
Combining genome and metagenome analyses.
5.1.3
Taking structures into account in association studies.
5.2
Learning with complementary datasets
5.2.1
Setting and notations
5.2.2
Interactions in linear models
5.2.3
Compact model
5.2.4
Recovering relevant interactions
5.3
Method
5.3.1
Preprocessing of the data
5.3.2
Preprocessing of metagenomic data
5.3.3
Structuring the data
5.3.4
Using the structure efficiently
5.3.5
Identification of relevant supervariables
5.4
Numerical simulations
5.4.1
Data generation
5.4.2
Generation of the phenotype
5.4.3
Comparison of methods
5.4.4
Evaluation metrics
5.4.5
Performance results
5.4.6
Computational time
5.5
Application on real data: rhizosphere of
Medicago truncatula
5.5.1
Material
5.5.2
Analysis
5.5.3
Results
5.5.4
Results on Root Shoot Ratio
5.5.5
Results on Specific Nitrogen Uptake
5.6
Discussions
Conclusions
Discussions on LEOS algorithm
Discussions on SICOMORE algorithm
Perspectives
Annexes
A
Derivation of the MSE bias-variance decomposition
B
Linear smoother
(Buja, Hastie, and Tibshirani
1989
)
C
Smoothing parameter
\(\lambda\)
for smoothing splines
D
\(\mathit{B}\)
-spline basis
References
Published with bookdown