5.6 Discussions
Although the detection of interaction effects in a high-dimensional remain a difficult problem, on one hand due to the multiple testing burden and on the other hand to the small effect sizes in term of significance, our approach has demonstrated the ability to recover interaction effects with a high statistical power. In our simulations, whether we varied the sample sizes, noise or number of true interactions, SICOMORE always exhibited the strongest recall compared to MLGL, HCAR or glinternet. This can be explained mainly by the fact that we advantageously use the strengths of different methods to combine them in a powerful single algorithm.
Regarding the results in terms of precision, we can see that all methods exhibit weak performance mainly due to the fact that the algorithms select groups which are too high in the hierarchy, i.e. that the selected supervariables, or groups of single variables for MLGL, contain too many variables. This results in the detection of interactions between the complementary datasets with a good power but a weak resolution. One solution would be to constrain the algorithm to work only on the lowest levels of the two hierarchies at a potential cost in terms of recall.
As for the application of our method to the Medicago truncatula dataset, we were able to find significant interactions between genomic and metagenomic features in relation with 3 phenotypes. Particularly we notice than one particular microbial species, ‘Ramlibacter’, seems to highly interact with the genome of the plant. We detected a lot of interactions for the RTR phenotype with potentially interesting genomic regions to look at in more details. The results on the phenotype SNU are more difficult to interpret because it is a very large group of microbial species which interact with the genome. Furthermore, we can notice in these results that the variable selection step suffers from instability. Indeed, as we used the same metagenomic data across the different options, the number of selected groups should also remains the same, but it is not the case. This instability could be due to the cross-validation step necessary to estimate the hyper-parameters and would need some adjustments to be corrected.
To conclude we can state that SICOMORE is able to find significant metagenomic-genomic interactions in a high dimensional context within a reasonable computational time. Indeed, the algorithm is able to work very fast even with large genomic dataset, an analysis of the full genomic data only takes a few hours to run and only a few tens of minutes if we work on a small subset of the data.