Many real-world Electronic Health Record (EHR) data contain a large proportion of missing values. Leaving a substantial portion of missing information unaddressed usually causes significant bias, leading to invalid conclusions to be drawn. On the other hand, training a machine learning model with a much smaller nearly-complete subset can drastically impact the reliability and accuracy of model inference. Data imputation algorithms that attempt to replace missing data with meaningful values, inevitably increase the variability of effect estimates with increased missingness, making it unreliable for hypothesis validation. We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, an effective approach to construct multiple subsets with much lower missing rates of the original EHR data as well as to mobilize dedicated support data for ensemble learning, for the purpose of reducing the bias caused by substantial missing values. ELMV has been evaluated on real-world healthcare data for critical feature identification and simulation data with different missing rates for outcome prediction. In both experiments, ELMV outperforms conventional missing value imputation methods and traditional ensemble learning models. The source code of ELMV is available at https://github.com/lucasliu0928/ELMV.
|Title of host publication||Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020|
|State||Published - Sep 21 2020|
|Event||11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020 - Virtual, Online, United States|
Duration: Sep 21 2020 → Sep 24 2020
|Name||Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020|
|Conference||11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020|
|Period||9/21/20 → 9/24/20|
Bibliographical noteFunding Information:
This project is supported by the NIH National Cancer Institute (grant no. 1R21CA231911) and the Kentucky Lung Cancer Research Program (grant no. KLCR-3048113817) to JL and JC and by the Clinical Retrospective Study of Shanghai Jiaotong University Affiliated 6th People’s Hospital (grant no. YNHG201912) to HZ and JD. Chinese Clinical Trial Registry Number: ChiCTR-ONN-17012895.
© 2020 ACM.
- Electronic Health Record (EHR)
- Ensemble learning
- Machine Learning
- Missing Values
- Multiple classifier system (MCS)
ASJC Scopus subject areas
- Computer Science Applications
- Biomedical Engineering
- Health Informatics