Abstract
Unsupervised learning, particularly clustering, plays a pivotal role in disease subtyping and patient stratification, especially with the abundance of large-scale multi-omics data. Deep learning models, such as variational autoencoders (VAEs), can enhance clustering algorithms by leveraging inter-individual heterogeneity. However, the impact of confounders—external factors unrelated to the condition, e.g. batch effect or age—on clustering is often overlooked, introducing bias and spurious biological conclusions. In this work, we introduce four novel VAE-based deconfounding frameworks tailored for clustering multi-omics data. These frameworks effectively mitigate confounding effects while preserving genuine biological patterns. The deconfounding strategies employed include (i) removal of latent features correlated with confounders, (ii) a conditional VAE, (iii) adversarial training, and (iv) adding a regularization term to the loss function. Using real-life multi-omics data from The Cancer Genome Atlas, we simulated various confounding effects (linear, nonlinear, categorical, mixed) and assessed model performance across 50 repetitions based on reconstruction error, clustering stability, and deconfounding efficacy. Our results demonstrate that our novel models, particularly the conditional multi-omics VAE (cXVAE), successfully handle simulated confounding effects and recover biologically driven clustering structures. cXVAE accurately identifies patient labels and unveils meaningful pathological associations among cancer types, validating deconfounded representations. Furthermore, our study suggests that some of the proposed strategies, such as adversarial training, prove insufficient in confounder removal. In summary, our study contributes by proposing innovative frameworks for simultaneous multi-omics data integration, dimensionality reduction, and deconfounding in clustering. Benchmarking on open-access data offers guidance to end-users, facilitating meaningful patient stratification for optimized precision medicine.
| Original language | English |
|---|---|
| Article number | bbae512 |
| Journal | Briefings in Bioinformatics |
| Volume | 25 |
| Issue number | 6 |
| DOIs | |
| State | Published - Nov 1 2024 |
Bibliographical note
Publisher Copyright:© The Author(s) 2024.
Funding
The authors thank the supporters of this study, namely the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 860895 TranSYS. Furthermore we thank the members of the Computational Population Biology group at Erasmus Medical Center for their critical and creative input to this work. We also would like to extend our gratitude to the PhD candidates enrolled in the ”Frontières de l’Innovation en Recherche et Éducation” (FIRE) doctoral school and thank them for their critical reviewing and feedback on the preprint of this study. This work was supported by the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement [860895 to Z.L., S.K., and K.V.S.]; E. S. acknowledges the funding received from The Netherlands Organisation for Health Research and Development (ZonMW) through the PERMIT project (Personalized Medicine in Infections: from Systems Biomedicine and Immunometabolism to Precision Diagnosis and Stratification Permitting Individualized Therapies, project number 456008002) under the PerMed Joint Transnational call JTC 2018 (Research projects on personalized medicine—smart combination of pre-clinical and clinical research with data and ICT solutions).
| Funders | Funder number |
|---|---|
| Horizon 2020 Framework Programme | |
| H2020 Marie Skłodowska-Curie Actions | 860895 |
| ZonMw Memorabel | 456008002 |
Keywords
- autoencoder
- clustering
- confounders
- deep learning
- fairness
- multi-omics
ASJC Scopus subject areas
- Information Systems
- Molecular Biology