2.6 Statistical testing of significance

2.6.1 Introduction

In statistical hypothesis testing, statistical significance refers to the acceptance or reject of the null hypothesis and corresponds to the likelihood that the difference between a given variation and the baseline is not due to random chance. For a given study, the defined level of significance \(\alpha\) is the probability to reject the true null hypothesis and the p-value, \(p\), is the probability of obtaining a result at least as extreme given that \(H_0\) is true. We can therefore state that the result is statistically significant, by the standard of the study, if \(p < \alpha\).

Ronald Fisher first advanced the idea of statistical hypothesis testing in his famous publication Statistical Methods for Research Workers (Fisher 1935). He suggested a probability of \(5\%\) has an acceptable threshold level to reject the null hypothesis and this cut-off was later taken over by Jezzy Neyman and Egon Pearson in (Neyman and Pearson 1933) where they named it the significance level \(\alpha\).

They proposed the following hypothesis testing procedure:

  1. Before getting the experimental measures:
  • Define the null hypothesis \(H_0\) and the alternative hypothesis \(H_1\).

  • Choose a level \(\alpha\).

  • Choose a test statistic, \(T\), which is larger under \(H_1\) than under \(H_0\): \[\text{Reject } H_0 \Leftrightarrow T \geq u.\]

  • Study the distribution of \(T\) under \(H_0\) and set the following condition:

\[\mathbb{P}(T \geq u) \leq \alpha .\]

  • Deduce the threshold \(u\).

  • Give the test with the value retained for \(u\) and the real level:

\[ \text{Reject } H_0 \Leftrightarrow T \geq u .\]

  1. Once the measures are done:
  • Perform the numerical application and conclude if we accept or reject \(H_0\) based on the \(p\text{-value} = \mathbb{P}(T \geq t_{obs})\).

with

  • Type I error: \(\alpha = \mathbb{P}\) (accept \(H_1\), \(H_0\) is true),

  • Type II error: \(\beta = \mathbb{P}\) (accept \(H_0\), \(H_1\) is true),

  • Power of the test: \(1 - \beta = \mathbb{P}\) (accept \(H_1\), \(H_1\) is true).

and the confusion matrix defined in Table 2.1.

Table 2.1: Confusion matrix
\(H_0\) true \(H_1\) true
\(H_0\) accepted True Positive False Positive
\(H_1\) accepted False Negative True Negative

2.6.2 \(\chi^2\) test

The chi-squared test, also written as \(\chi^2\) test, is a statistical hypothesis test developed by Karl Pearson and first published in (Pearson 1900). It is used when the sampling distribution of the test statistic under the null hypothesis follows a chi-squared distribution.

The \(\chi^2\) distribution with k degrees of freedom is the distribution of a sum of the squares of \(D\) independent standard normal random variables. If \(\mathrm{X}_1, ..., \mathrm{X}_D\) are independent, normally distributed random variables, then the sum of their squares: \[Z =\sum _{d=1}^{D} \mathrm{X}_d^2,\] is distributed according to the \(\chi^2\) distribution with \(D\) degrees of freedom. This is usually denoted as \(Z \sim \chi^{2}(D)\) or \(Z \sim \chi_D^2\). The chi-squared distribution has one parameter: \(D\) — a positive integer that specifies the number of degrees of freedom.

The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. Test statistics that follow a chi-squared distribution arise from an assumption of independent normally distributed data, which is valid in many cases due to the central limit theorem.

2.6.3 Likelihood ratio test

The likelihood ratio test is used for comparing the goodness of fit of two statistical models, a null model against an alternative model. The log-likelihood ratio statistic is generally used to compute a p-value to decide whether or not to reject the null model.

Given the null \(H_0 : \theta = \theta_0\) and the alternative hypothesis \(H_1 = \theta = \theta_1\) for a statistical model \(f(\boldsymbol{x}|\theta)\), the likelihood ratio is defined as

\[\Lambda(\boldsymbol{x}) = \frac{l(\theta_0 | \boldsymbol{x})}{l(\theta_1 | \boldsymbol{x})},\] where \(\theta \mapsto l(\theta|\boldsymbol{x})\) is the likelihood function and with \(\alpha = \mathbb{P}(\Lambda(\boldsymbol{x}) \leq u | H_0)\) the significance level at a threshold \(u\).

In practice we define the test statistic as \[\begin{aligned} T & = -2 \log \left( \frac{l(\theta_0 | \boldsymbol{x})}{l(\theta_1 | \boldsymbol{x})} \right) \\ & = 2 \times [\log(l(\theta_1 | \boldsymbol{x})) - \log(l(\theta_0 | \boldsymbol{x}))]\end{aligned}\]

The Neyman-Pearson lemma introduced in (Neyman and Pearson 1933) states that the likelihood ratio test is the most powerful test at a significance level \(\alpha\).

2.6.4 Calculation of p-values in GAM

Let \(\boldsymbol{\beta}^j \in \mathbb{R}^K\) be the coefficients vector of the \(k\) covariates for a single smooth term \(j\) and \(\mathbf{V}_{\boldsymbol{\beta}_j}\) the covariance matrix of \(\boldsymbol{\beta}_j\). In the context of generalized additive models, if the covariates of the smooth are uncorrelated with other smooth terms in the model, then \(\mathbb{E}(\hat{\boldsymbol{\beta}_j}) = 0\), otherwise there is little bias and \(\mathbb{E}(\hat{\boldsymbol{\beta}_j}) \simeq 0\).

Under the null hypothesis \(H_0: \boldsymbol{\beta}_j = 0\) we have \[\hat{\boldsymbol{\beta}_j} \thicksim \mathcal{N}(0, \mathbf{V}_{\boldsymbol{\beta}_j}).\]

It follows that if \(\mathbf{V}_{\boldsymbol{\beta}_j}\) is of full rank, then under the null hypothesis

\[\hat{\boldsymbol{\beta}_j}^T \mathbf{V}_{\boldsymbol{\beta}_j}^{-1} \hat{\boldsymbol{\beta}_j} \thicksim \chi^2_k.\]

However, applying a penalty on the coefficients of the smooth, as it is the case with smoothing splines, often suppress some dimensions of the parameter space and consequently the covariance matrix \(\mathbf{V}_{\boldsymbol{\beta}_j}\) is not of full rank. If so, the test is performed using the rank \(r = rank(\mathbf{V}_{\boldsymbol{\beta}_j})\) pseudo-inverse of the covariance matrix \(\mathbf{V}_{\boldsymbol{\beta}_j}^{r-}\) and under the null, \[\hat{\boldsymbol{\beta}_j}^T \mathbf{V}_{\boldsymbol{\beta}_j}^{-r} \hat{\boldsymbol{\beta}_j} \thicksim \chi^2_r.\]

As stated in (Wood 2006), as long as the p-values give a clear cut result it is usually safe to rely on them, but when they are close to the threshold of accepting or rejecting the null, they must be carefully treated. Indeed, as the uncertainty on the smoothing parameter estimation has been neglected in the reference distribution used for testing, these distributions are typically too narrow and attribute too low a probability to moderately high values in the test statistics. In that case, to obtain more accurate p-values, it may be preferable to perform test on overspecified unpenalized models even if it induces a cost in terms of statistical power.

2.6.5 Multiple testing comparison

In some context, as it is the case with the analysis of genes expression data or in Genome-Wide Association Studies (GWASs) for instance, we may need to perform simultaneously a very large number, \(d \in [1,\dots,D]\), of tests and therefore the same large number of p-value. If we reject, for the \(d^{th}\) tests, the null hypothesis \(H_{0,d}\) when its associated p-value \(\hat{p}_d\) is not larger than \(\alpha\), then for each tests \(d\), the probability to reject wrongly \(H_{0,d}\) is at most \(\alpha\). Nevertheless, if we consider the \(D\) tests simultaneously the number of hypothesis \(H_{0,d}\) wrongly rejected (false positive or type I error) can be very large. Actually, the expectation of the number of false positives in given by:

\[\mathbb{E}[\text{False Positives}] = \sum_{d:H{0,d}}^D \mathbb{P}_{H{0,d}}(T_d \geq u_{\alpha}) = \text{card} \lbrace d:H_{0,d} \text{ is true} \rbrace\times \alpha,\] if the threshold \(u_{\alpha}\) is such that \(\mathbb{P}_{H{0,d}} = \alpha\)for every \(d\). For instance, for a typical value of \(\alpha = 5 \%\) and card \(\lbrace d:H_{0,d} \text{ is true} \rbrace =1000\), then we obtain on average 500 false positives. It is therefore necessary to adjust the threshold \(u_{\alpha}\) at which we reject the null hypothesis in order to control for the number of false positives while not losing too much power.

Controlling the Family-Wise Error Rate

There exist many adjustments methods for multiple testing, including controls of the Family-Wise Error Rate (FWER), i.e. the probability of rejecting \(H_0\) when it is true at least one time, noted as \[\text{FWER} = \mathbb{P}(\text{card(False Positives)} \geq 1).\]

  • Bonferroni procedure:

The most commonly used method for controlling the FWER is the Bonferroni method (Bonferroni 1936). The test of each \(H_d\) is controlled so that the probability of a Type I error is less than or equal to \(\alpha/D\), ensuring that the overall FWER is less than to a given \(\alpha\).

  • Šidák method:

The method of (Šidák 1967) is closely related to Bonferroni’s procedure where the p-value are adjusted as: \[p_d^{adj} = 1 - (1-p_d)^D,\] where \(p_d\) is the unadjusted p-value for the \(d^{th}\) test.

  • Holm method:

A less conservative adjustment method is the (Holm 1979a) method that orders the p-values and makes successively smaller adjustments. Let the ordered p-values be denoted by \(p_{1} \leq p_{2} \leq \dots \leq p_{D}\). Then, the Holm method calculates the adjusted p-values by \[\begin{aligned} & p_{1}^{adj} = D \times p_{1}, & \nonumber\\ & p_{1}^{adj} = \text{max} \lbrace p_{d-1},(D-d+1) \times p_{d} \rbrace \ 1 \leq d \leq D.&\end{aligned}\]

The principal issue with these approaches is that they control the probability of at least one false positive regardless of the number of hypothesis being tested. They reduce the number of type I error but tends to be very conservative in the sense that the number of type II error is increased resulting in a loss of power. That is why less conservative methods are preferred in high-dimensional settings.

Controlling the False Discovery Rate

The False Discovery Proportion (FDP) corresponds to the proportion of false positives among the positive FP/(FP+TP). The False Discovery Rate, introduced in the seminal paper of (BH, Benjamini and Hochberg 1995), is defined as the expected value of the FDP:

\[\text{FDR} = \mathbb{E} \left[ \frac{\text{FP}}{\text{FP+TP}} \mathbb{1}_{\text{FP+TP} \geq 1} \right]. \label{eq:FDR}\]

Controlling the FDR quantity offers a less conservative multiple-testing criterion than the FWER control. (Benjamini and Hochberg 1995) proved that their approach, referred as the BH procedure, control the FDR at level \(\alpha\) under the condition that the p-values following the null distribution are independent and uniformly distributed.

The BH procedure can be described as follow: Step 1 : Let \(p_{1} \leq p_{2} \leq \dots \leq p_{D}\) be the observed p-values.

Step 2 : Calculate \[\hat{k} = \underset{1\leq k \leq D}{\text{argmax}} \lbrace k:p_k \leq \alpha k/D \rbrace.\]

Step 3 : If \(\hat{k}\) exists, then reject the null hypothesis corresponding to \(p_1 \leq \dots \leq p_k\). If not, accept the null hypothesis for all tests.

(Benjamini and Hochberg 1995) have shown that the FDR is upper-bounded by: \[\text{FDR} \leq \alpha d_0/D,\] with \(d_0\) the number of true null hypothesis and have shown that this upper bounding is also true for positively dependent test statistics, i.e. when the distribution of p-values fulfils the Weak Positive Regression Dependency Property (WPRDS).

Since the BH procedure controls the FDR at a level of \(\alpha d_0/D\) instead of \(\alpha\), a lot of work has been done in order to achieve a better level, mainly by trying to estimate \(d_0\) (see (Roquain 2010) and references therein for more details).

References

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B 57 (1): 289–300.

Bonferroni, C. 1936. “Teoria Statistica Delle Classi E Calcolo Delle Probabilita.” Pubblicazioni Del R Istituto Superiore Di Scienze Economiche E Commericiali Di Firenze 8: 3–62.

Fisher, Ronald A. 1935. “Statistical Methods for Research Workers.” Edinburgh: Oliver and Boyd, 1934 and the Logic of Inductive Interence; Royal Statistical Society 98: S–39.

Holm, Sture. 1979a. “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics, 65–70.

Neyman, Jerzy, and Egon S Pearson. 1933. “The Testing of Statistical Hypotheses in Relation to Probabilities a Priori.” In Mathematical Proceedings of the Cambridge Philosophical Society, 29:492–510. Cambridge University Press.

Pearson, Karl. 1900. “On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50 (302): 157–75.

Roquain, Etienne. 2010. “Type I Error Rate Control for Testing Many Hypotheses: A Survey with Proofs.” arXiv Preprint arXiv:1012.4078.

Šidák, Zbyněk. 1967. “Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.” Journal of the American Statistical Association 62 (318): 626–33.

Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. crcpress. https://www.crcpress.com/Generalized-Additive-Models-An-Introduction-with-R/Wood/p/book/9781584884743.