5.2 Learning with complementary datasets

This section introduces the setting with the notations. It also sketches the approach to define a compact model of interactions between complementary datasets.

5.2.1 Setting and notations

Let us consider observations stemming from two complementary views, \(\mathit{G}\) (for Genomic data) and \(\mathit{M}\) (for Metagenomic data), which are gathered into a training set \(\mathcal{T} = \{(\mathbf{x}^\mathit{G}_i, \mathbf{x}^\mathit{M}_i, y_i)\}_{i=1}^N\), where \((\mathbf{x}^\mathit{G}_i, \mathbf{x}^\mathit{M}_i, y_i) \in \mathbb{R}^{D_\mathit{G}} \times \mathbb{R}^{D_\mathit{M}} \times \mathbb{R}\).

We assume an underlying biological information on \(\mathit{G}\) and \(\mathit{M}\) encoded as groups. The group structure over \(\mathit{G}\) is defined by \(N_{\mathit{G}}\) groups of variables \(\mathcal{G}=\{\mathcal{G}_{g} \}_{g=1}^{N_{\mathit{G}}}\). We denote \(\mathbf{x}_i^{g} \in \mathbb{R}^{D_{g}}\), the sample \(i\) restricted to the variables of \(\mathit{G}\) from group \(\mathcal{G}_{g}\). Similarly, the group structure over \(\mathit{M}\) is defined by \(N_{\mathit{M}}\) groups of variables \(\mathcal{M}=\{\mathcal{M}_{m}\}_{m=1}^{N_{\mathit{M}}}\) and \(\mathbf{x}_i^{m} \in \mathbb{R}^{D_{m}}\) is the sample \(i\) restricted to the variables of \(\mathit{M}\) from group \(\mathcal{M}_{m}\).

We also introduce \(D_I = D_{\mathit{G}} \cdot D_{\mathit{M}}\) and \(N_I = N_{\mathit{G}} \cdot N_{\mathit{M}}\), the number of variables and the number of groups that may interact.

Finally, we use the following convention: vectors of observations indexed with \(i\), such as \(\mathbf{x}_i\), will usually be row vectors¹⁵ while vectors of coefficients, such as \(\boldsymbol{\beta}\), will usually be column vectors.

5.2.2 Interactions in linear models

Interactions between data stemming from views \(\mathit{G}\) and \(\mathit{M}\) may be captured in the model \[\label{eq:classical_interaction_model} y_i = \mathbf{x}^{\mathit{G}}_i \boldsymbol{\gamma}_{\mathit{G}} + \mathbf{x}^{\mathit{M}}_i \boldsymbol{\gamma}_{\mathit{M}} + \mathbf{x}^{\mathit{G}}_i \boldsymbol{\Delta}_{\mathit{G}\mathit{M}} (\mathbf{x}^{\mathit{M}}_i)^T + \epsilon_i \,,\] where the vectors \(\boldsymbol{\gamma}_{\mathit{G}} \in \mathbb{R}^{D_{\mathit{G}}}\) and \(\boldsymbol{\gamma}_{\mathit{M}} \in \mathbb{R}^{D_{\mathit{M}}}\) respectively denote the linear effects related to \(\mathit{G}\) and \(\mathit{M}\), the matrix \(\boldsymbol{\Delta}_{\mathit{G}\mathit{M}} \in \mathbb{R}^{D_{\mathit{G}} \times D_{\mathit{M}}}\) contains the interactions between all pairs of variables of \(\mathit{G}\) and \(\mathit{M}\) and \(\epsilon_i \in \mathbb{R}\) is a residual error.

Underlying notions in models of interactions are the one of strong dependency (SD) and weak dependency (WD), the first one being more common (see for instance (Bien, Taylor, and Tibshirani 2013) and the discussion therein). Under the hypothesis of strong dependency, an interaction is effective if and only if the corresponding single effects are also effective while the hypothesis of weak dependency implies that an interaction is effective if one of the main effect is also effective. Formally, for all variables \(j \in \mathbf{x}^{\mathit{G}}\) and for all variables \(j' \in \mathbf{x}^{\mathit{M}}\), if \(\gamma_{j}\), \(\gamma_{j'}\) and \(\delta_{jj'}\) are the coefficients related to \(\boldsymbol{\gamma}_{\mathit{G}}\), \(\boldsymbol{\gamma}_{\mathit{M}}\) and \(\boldsymbol{\Delta}_{\mathit{G}\mathit{M}}\), then \[\begin{aligned} & (SD) && \delta_{jj'} \neq 0 \qquad \Rightarrow \qquad \gamma_{j} \neq 0 && \text{ and } && \gamma_{j'} \neq 0 \,, \\ & (WD) && \delta_{jj'} \neq 0 \qquad \Rightarrow \qquad \gamma_{j} \neq 0 && \text{ or } && \gamma_{j'} \neq 0 \,.\end{aligned}\]

In this context, (Bien, Taylor, and Tibshirani 2013) have proposed a sparse model of interactions which faces computational limitations for large dimensional problems according to (Lim and Hastie 2015) and (She, Wang, and Jiang 2016). While (Lim and Hastie 2015) introduce a method for learning pairwise interactions in a regression model by solving a constrained overlapping Group-Lasso (Jacob, Obozinski, and Vert 2009) in a manner that satisfies strong dependencies, (She, Wang, and Jiang 2016) propose a formulation with an overlapping regularization that fits both kind of hypotheses and provide theoretical insights on the resulting estimators¹⁶.

Yet, the dimension \(D_{\mathit{G}} + D_{\mathit{M}} + D_I\) involved in Problem to estimate \(\boldsymbol{\gamma}_{\mathit{G}}\), \(\boldsymbol{\gamma}_{\mathit{M}}\) and \(\boldsymbol{\Delta}_{\mathit{G}\mathit{M}}\) may be large especially for applications with an important number of variables such as in biology with genomic and metagenomic data. To reduce the dimension, we propose to compress the data according to an underlying structure which may be defined thanks to a prior knowledge or be uncovered with clustering algorithms.

5.2.3 Compact model

Assuming we are given a compression function for each group of \(\mathit{G}\) and \(\mathit{M}\), we can shape Problem into a compact form \[\label{eq:compact_detailled_model} y_i = \sum_{g \in \mathcal{G}} \tilde{x}_i^g \beta_{g} + \sum_{m \in \mathcal{M}} \tilde{x}_i^m \beta_{m} + \sum_{g \in \mathcal{G}} \sum_{m \in \mathcal{M}} \underbrace{\left(\tilde{x}_i^g \cdot \tilde{x}_i^m\right)}_{\boldsymbol{\phi}^{gm}_i} {\theta_{gm}} + \, \epsilon_i \,,\] where \(\tilde{x}_i^g \in \mathbb{R}\) is the \(i^{th}\) compressed sample of the variables that belong to the group \(g\) for the view \(\mathit{G}\) and \(\beta_{g} \in \mathbb{R}\) is its corresponding coefficient. The counterparts on the group \(m\) for the view \(\mathit{M}\) are \(\tilde{x}_i^m \in \mathbb{R}\) and \(\beta_{m} \in \mathbb{R}\). Finally, \(\theta_{gm} \in \mathbb{R}\) is the interaction between groups \(g\) and \(m\).

We can reformulate Problem in a vector form. Let \(\tilde{\mathbf{x}}_{i}^{\mathit{G}} \in \mathbb{R}^{N_{\mathit{G}}}\), \(\boldsymbol{\beta}_{\mathit{G}} \in \mathbb{R}^{N_{\mathit{G}}}\), \(\tilde{\mathbf{x}}_{i}^{\mathit{M}} \in \mathbb{R}^{N_{\mathit{M}}}\) and \(\boldsymbol{\beta}_{\mathit{M}} \in \mathbb{R}^{N_{\mathit{M}}}\) be \[\begin{aligned} \tilde{\mathbf{x}}_{i}^{\mathit{G}} & = (\tilde{x}_i^1 \cdots \tilde{x}_i^g \cdots \tilde{x}_i^{N_{\mathit{G}}})\,, & \boldsymbol{\beta}_{\mathit{G}} & = (\beta_{1} \cdots \beta_{g} \cdots \beta_{N_{\mathit{G}}})^{T}\,,\\ \tilde{\mathbf{x}}_{i}^{\mathit{M}} & = (\tilde{x}_i^1 \cdots \tilde{x}_i^m \cdots \tilde{x}_i^{N_{\mathit{M}}}) \,, & \boldsymbol{\beta}_{\mathit{M}} & = (\beta_{1} \cdots \beta_{m} \cdots \beta_{N_{\mathit{M}}})^{T}\,.\end{aligned}\]

We denote by \(\boldsymbol{\phi}_{i} \in \mathbb{R}^{N_I}\), the vector whose general component is given by \(\boldsymbol{\phi}^{gm}_i\) in Equation , that is \[\begin{aligned} \boldsymbol{\phi}_{i} & = \left( \boldsymbol{\phi}^{11}_i \cdots \boldsymbol{\phi}^{1N_{\mathit{M}}}_i \cdots \boldsymbol{\phi}^{g m}_i \cdots \boldsymbol{\phi}^{{N_{\mathit{G}}} 1}_i \cdots \boldsymbol{\phi}^{{N_{\mathit{G}}} {N_{\mathit{M}}}}_i \right) \,,\end{aligned}\] and \(\boldsymbol{\theta} \in \mathbb{R}^{N_I}\), the corresponding vector of coefficients, by \[\begin{aligned} \boldsymbol{\theta} = & \left(\theta_{11} \cdots \theta_{1 {N_{\mathit{M}}}} \cdots \theta_{g m} \cdots \theta_{{N_{\mathit{G}}} 1} \cdots \theta_{{N_{\mathit{G}}} {N_{\mathit{M}}}} \right)^T\,. \end{aligned}\]

Finally, Problem reads as a classical linear regression problem \[\label{eq:compact_vector_model} y_i = \tilde{\mathbf{x}}_{i}^{\mathit{G}} \boldsymbol{\beta}_{\mathit{G}} + \tilde{\mathbf{x}}_{i}^{\mathit{M}} \boldsymbol{\beta}_{\mathit{M}} + \boldsymbol{\phi}_{i} \boldsymbol{\theta} + \epsilon_i \,,\] of dimension \(N_{\mathit{G}} + N_{\mathit{M}} + N_I\).

5.2.4 Recovering relevant interactions

Compared to Problem and provided that \(N_\mathit{G}\) and \(N_\mathit{M}\) are reasonably lower than \(D_{\mathit{G}}\) and \(D_{\mathit{M}}\), the dimension of Problem decreases drastically so that it might be solved thanks to an appropriate optimization algorithm coupled with effective computational facilities. For instance, (Donoho and Tsaig 2008) give an overview of \(\ell_1\) regularized algorithms to solve sparse problems like Lasso, which in our case could take the form:

\[\begin{aligned} \underset{\boldsymbol{\beta}_{\mathit{G}}, \boldsymbol{\beta}_{\mathit{M}}, \boldsymbol{\theta}}{\text{argmin}}, & \quad \sum_{i=1}^n \left(y_i - \tilde{\mathbf{x}}_{i}^{\mathit{G}} \boldsymbol{\beta}_{\mathit{G}} - \tilde{\mathbf{x}}_{i}^{\mathit{M}} \boldsymbol{\beta}_{\mathit{M}} - \boldsymbol{\phi}_{i} \boldsymbol{\theta} \right)^2 + \lambda_{\mathit{G}} \sum_{g=1}^{N_{\mathit{G}}}|\beta_{g}| + \lambda_{\mathit{M}} \sum_{m=1}^{N_{\mathit{M}}} |\beta_{m}| + \lambda_I \sum_{g,m=1}^{N_I} |\theta_{gm}|, \end{aligned}\]

with \(\lambda_\mathit{G}\), \(\lambda_\mathit{M}\) and \(\lambda_I\) being the positive hyperparameters that respectively control the amount of sparsity related to coefficients \(\boldsymbol{\beta}_\mathit{G}\), \(\boldsymbol{\beta}_\mathit{M}\) and \(\boldsymbol{\theta}\). Still, the dimension may remain large regarding the dimension \(N_\mathit{G} + N_\mathit{M} + N_I\) compared to the number of observations \(N\). Also, note that without additional constraints, such a formulation would not induce the dependences hypothesis (SD) and (WD). For that purpose, one could adapt the works of (Bien, Taylor, and Tibshirani 2013; Lim and Hastie 2015) or (She, Wang, and Jiang 2016) mentioned above. We present in the next Section another way to reduce further the dimension and fulfil the strong dependency hypothesis.

References

Bien, Jacob, Jonathan Taylor, and Robert Tibshirani. 2013. “A Lasso for Hierarchical Interactions.” Annals of Statistics 41 (3): 1111.

Donoho, David L, and Yaakov Tsaig. 2008. “Fast Solution of-Norm Minimization Problems When the Solution May Be Sparse.” IEEE Transactions on Information Theory 54 (11): 4789–4812.

Jacob, Laurent, Guillaume Obozinski, and Jean-Philippe Vert. 2009. “Group LAsso with Overlap and Graph LAsso.” In Proceedings of the 26th Annual International Conference on Machine Learning, 433–40. Montreal, Quebec, Canada.

Lim, Michael, and Trevor Hastie. 2015. “Learning Interactions via Hierarchical Group-Lasso Regularization.” Journal of Computational and Graphical Statistics 24 (3): 627–54.

She, Yiyuan, Zhifeng Wang, and He Jiang. 2016. “Group Regularized Estimation Under Structural Hierarchy.” Journal of the American Statistical Association 113 (521): 445–54.

For the sake of clarity, we use these lightened notations which slightly differ from those used in the previous chapters.↩
To our knowledge, their implementation based on alternating direction method of multipliers is not publicly available.↩