• Statistical learning for omics association and interactions studies based on blockwise feature compression
  • Abstract
  • General introduction
  • Notations
  • Abbreviations
  • 1 Basic concepts of molecular genetics
    • 1.1 Genome description
    • 1.2 Genome sequencing
      • 1.2.1 DNA sequencing
      • 1.2.2 Sequence assembly
    • 1.3 DNA polymorphism
      • 1.3.1 Restriction Fragment Length Polymorphisms (RFLP)
      • 1.3.2 Simple Sequence Length Polymorphisms (SSLP)
      • 1.3.3 Single Nucleotide Polymorphisms (SNP)
    • 1.4 Linkage and partial linkage for genetic mapping
    • 1.5 Basic concepts in population genetics
      • 1.5.1 Hardy-Weinberg equilibrium in large population
      • 1.5.2 Genetic drift in small population
      • 1.5.3 Concept of heritability
    • 1.6 Linkage disequilibrium
      • 1.6.1 Definition
      • 1.6.2 Measure of LD
      • 1.6.3 Estimation of linkage disequilibrium
      • 1.6.4 Origins of linkage disequilibrium
    • 1.7 Structure of haplotype blocks in the human genome
      • Definition of haplotype blocks
      • Patterns in human genome
  • 2 Statistical context
    • 2.1 Notations
    • 2.2 Concepts of statistical learning
      • 2.2.1 Prediction
    • 2.3 Parametric methods
      • 2.3.1 Linear models
      • 2.3.2 Penalized linear regression
      • 2.3.3 Generalized linear models
    • 2.4 Splines and generalized additive models: Moving beyond linearit
      • 2.4.1 Introduction
      • 2.4.2 Regression splines
      • 2.4.3 \(\mathrm{B}\)-splines
      • 2.4.4 Cubic smoothing splines
      • 2.4.5 Generalized additive models (GAM)
      • 2.4.6 High-dimensional generalized additive models (HGAM)
    • 2.5 Combining cluster analysis and variable selection
      • 2.5.1 Hierarchical clustering
      • 2.5.2 Hierarchical Clustering and Averaging Regression
      • 2.5.3 Multi-Layer Group-Lasso (MLGL)
    • 2.6 Statistical testing of significance
      • 2.6.1 Introduction
      • 2.6.2 \(\chi^2\) test
      • 2.6.3 Likelihood ratio test
      • 2.6.4 Calculation of p-values in GAM
      • 2.6.5 Multiple testing comparison
  • 3 Genome-Wide Association Studies
    • 3.1 Introduction
    • 3.2 Genotype quality control
      • 3.2.1 Deviation from HWE.
      • 3.2.2 Missing data.
      • 3.2.3 Distribution of test statistics.
    • 3.3 Disease penetrance and odds ratio
    • 3.4 Single Marker Analysis
      • 3.4.1 Pearson’s \(\chi^2\) statistic
      • 3.4.2 Cochran-Armitage trend test
      • 3.4.3 Logistic regression and likelihood ratio test
    • 3.5 Limitations
    • 3.6 Population structure
      • 3.6.1 Genomic control
      • 3.6.2 Structured association
      • 3.6.3 Principle components correction
    • 3.7 Multi-locus analysis
      • 3.7.1 Haplotype-based approaches
      • 3.7.2 Rare-variant association analysis
      • 3.7.3 LD based approach to variable selection in GWAS
  • 4 Learning the Optimal in GWAS through hierarchical SNP aggregation
    • 4.1 Related work
    • 4.2 Method
      • 4.2.1 Step 1. Constrained-HAC
      • 4.2.2 Step 2. Dimension reduction function
      • 4.2.3 Step 3. Optimal number of groups estimation
      • 4.2.4 Step 4. Multiple testing on aggregated-SNP variables
    • 4.3 Numerical simulations
      • 4.3.1 Simulation of the case-control phenotype
      • 4.3.2 Performance evaluation
    • 4.4 Results
      • 4.4.1 Results and discussions of the numerical simulations
      • 4.4.2 Performance results for simulated data.
      • 4.4.3 Application in Wellcome Trust Case Control Consortium(WTCCC) and Ankylosing Spondylitis (AS) studies
      • 4.4.4 Results in WTCCC and AS studies
    • 4.5 Generalized additive models in GWAS
      • 4.5.1 Comparison of predictive power
      • 4.5.2 Results of univariate smoothing splines on aggregated-SNP
    • 4.6 Discussion
  • 5 Selection of interaction effects in compressed multiple omics representation
    • 5.1 Introduction
      • 5.1.1 Background
      • 5.1.2 Combining genome and metagenome analyses.
      • 5.1.3 Taking structures into account in association studies.
    • 5.2 Learning with complementary datasets
      • 5.2.1 Setting and notations
      • 5.2.2 Interactions in linear models
      • 5.2.3 Compact model
      • 5.2.4 Recovering relevant interactions
    • 5.3 Method
      • 5.3.1 Preprocessing of the data
      • 5.3.2 Preprocessing of metagenomic data
      • 5.3.3 Structuring the data
      • 5.3.4 Using the structure efficiently
      • 5.3.5 Identification of relevant supervariables
    • 5.4 Numerical simulations
      • 5.4.1 Data generation
      • 5.4.2 Generation of the phenotype
      • 5.4.3 Comparison of methods
      • 5.4.4 Evaluation metrics
      • 5.4.5 Performance results
      • 5.4.6 Computational time
    • 5.5 Application on real data: rhizosphere of Medicago truncatula
      • 5.5.1 Material
      • 5.5.2 Analysis
      • 5.5.3 Results
      • 5.5.4 Results on Root Shoot Ratio
      • 5.5.5 Results on Specific Nitrogen Uptake
    • 5.6 Discussions
  • Conclusions
    • Discussions on LEOS algorithm
    • Discussions on SICOMORE algorithm
    • Perspectives
  • Annexes
  • A Derivation of the MSE bias-variance decomposition
  • B Linear smoother (Buja, Hastie, and Tibshirani 1989)
  • C Smoothing parameter \(\lambda\) for smoothing splines
  • D \(\mathit{B}\)-spline basis
  • References
  • Published with bookdown