ELMV: An Ensemble-Learning Approach for Analyzing Electrical Health Records with Significant Missing Values

Lucas Jing Liu, Hongwei Zhang, Jianzhong Di, Jin Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Many real-world Electronic Health Record (EHR) data contain a large proportion of missing values. Leaving a substantial portion of missing information unaddressed usually causes significant bias, leading to invalid conclusions to be drawn. On the other hand, training a machine learning model with a much smaller nearly-complete subset can drastically impact the reliability and accuracy of model inference. Data imputation algorithms that attempt to replace missing data with meaningful values, inevitably increase the variability of effect estimates with increased missingness, making it unreliable for hypothesis validation. We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, an effective approach to construct multiple subsets with much lower missing rates of the original EHR data as well as to mobilize dedicated support data for ensemble learning, for the purpose of reducing the bias caused by substantial missing values. ELMV has been evaluated on real-world healthcare data for critical feature identification and simulation data with different missing rates for outcome prediction. In both experiments, ELMV outperforms conventional missing value imputation methods and traditional ensemble learning models. The source code of ELMV is available at https://github.com/lucasliu0928/ELMV.

Original languageEnglish
Title of host publicationProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
ISBN (Electronic)9781450379649
DOIs
StatePublished - Sep 21 2020
Event11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020 - Virtual, Online, United States
Duration: Sep 21 2020Sep 24 2020

Publication series

NameProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020

Conference

Conference11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
Country/TerritoryUnited States
CityVirtual, Online
Period9/21/209/24/20

Bibliographical note

Funding Information:
This project is supported by the NIH National Cancer Institute (grant no. 1R21CA231911) and the Kentucky Lung Cancer Research Program (grant no. KLCR-3048113817) to JL and JC and by the Clinical Retrospective Study of Shanghai Jiaotong University Affiliated 6th People’s Hospital (grant no. YNHG201912) to HZ and JD. Chinese Clinical Trial Registry Number: ChiCTR-ONN-17012895.

Publisher Copyright:
© 2020 ACM.

Keywords

  • Electronic Health Record (EHR)
  • Ensemble learning
  • Machine Learning
  • Missing Values
  • Multiple classifier system (MCS)

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Biomedical Engineering
  • Health Informatics

Fingerprint

Dive into the research topics of 'ELMV: An Ensemble-Learning Approach for Analyzing Electrical Health Records with Significant Missing Values'. Together they form a unique fingerprint.

Cite this