Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports

Hong Jun Yoon, Hilda B. Klasky, John P. Gounley, Mohammed Alawad, Shang Gao, Eric B. Durbin, Xiao Cheng Wu, Antoinette Stroup, Jennifer Doherty, Linda Coyle, Lynne Penberthy, J. Blair Christian, Georgia D. Tourassi

Research output: Contribution to journalArticlepeer-review

5 Scopus citations


Objective: In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. Materials and Methods:: The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem—thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). Results: We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. Conclusion: Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.

Original languageEnglish
Article number103564
JournalJournal of Biomedical Informatics
StatePublished - Oct 2020

Bibliographical note

Funding Information:
LTR data were collected using funding from NCI and the SEER Program ( HHSN261201800007I ), the NPCR ( NU58DP006332-02-00 ), ands the State of Louisiana .

Funding Information:
This research used resources of the OLCF at ORNL, which is supported by the DOE Office of Science under Contract No. DE-AC05-00OR22725 .

Funding Information:
The Utah Cancer Registry is funded by the NCI’s SEER Program , Contract No. HHSN261201800016I , and the NPCR , Cooperative Agreement No. NU58DP0063200 , with additional support from the University of Utah and Huntsman Cancer Foundation .

Funding Information:
KCR data were collected with funding from the NCI SEER Program ( HHSN261201800013I ), the CDC National Program of Cancer Registries (NPCR) ( U58DP00003907 ) and the Commonwealth of Kentucky .

Funding Information:
NJSCR data were collected using funding from NCI and the SEER) Program ( HHSN261201300021I , the ( NPCR ( NU58DP006279-02-00 ), and the State of New Jersey and the Rutgers Cancer Institute of New Jersey .

Funding Information:
This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan ( ).

Funding Information:
This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy (DOE) Office of Science and the National Nuclear Security Administration. This work has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by DOE and the National Cancer Institute of the National Institutes of Health. This work was performed under the auspices of DOE by Argonne National Laboratory under Contract DE-AC02-06-CH11357 , Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 , Los Alamos National Laboratory under Contract DE-AC5206NA25396 , and ORNL under Contract DE-AC05-00OR22725 .

Publisher Copyright:
© 2020 Elsevier Inc.


  • Bootstrap aggregation
  • Convolutional neural networks
  • Data partitioning
  • Deep learning
  • Hierarchical self-attention networks
  • High-performance computing
  • Natural language processing

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics


Dive into the research topics of 'Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports'. Together they form a unique fingerprint.

Cite this