Background: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. Objective: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The Objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. Results: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.
Bibliographical noteFunding Information:
The Utah Cancer Registry is funded by the National Cancer Institute’s SEER Program, Contract No. HHSN261201800016I, and the US Centers for Disease Control and Prevention’s National Program of Cancer Registries, Cooperative Agreement No. NU58DP0063 200, with additional support from the University of Utah and Huntsman Cancer Foundation.
This work was performed under the auspices of DOE by Argonne National Laboratory under Contract DE-AC02-06-CH11357, Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344, Los Alamos National Laboratory under Contract DE-AC5206NA25396, and Oak Ridge National Laboratory under Contract DE-AC05-00OR22725. This work also was supported by the Laboratory Directed Research and Development (LDRD) program of Oak Ridge National Laboratory, under LDRD project 9831.
The collection of cancer incidence data used in this study was supported by the California Department of Public Health pursuant to California Health and Safety Code Section 103885; Centers for Disease Control and Prevention’s (CDC) National Program of Cancer Registries, under cooperative agreement 5NU58DP006344; the National Cancer Institute’s Surveillance, Epidemiology and End Results Program under contract HHSN261201800032I awarded to the University of California, San Francisco; contract HHSN261201800015I awarded to the University of Southern California; and contract HHSN261201800009I awarded to the Public Health Institute. The ideas and opinions expressed herein are those of the author(s) and do not necessarily reflect the opinions of the State of California, Department of Public Health, the National Cancer Institute, and the Centers for Disease Control and Prevention or their contractors and subcontractors.
This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy (DOE) Office of Science and the National Nuclear Security Administration. This work was also supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by DOE and the NCI of the National Institutes of Health.
New Mexico Tumor Registry’s participation in this project was supported by Contract HHSN26120180001 4I, Task Order HHSN26100001 from the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) Program.
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the DOE Office of Science under Contract No. DE-AC05-00OR22725.
New Jersey State Cancer Registry data were collected using funding from NCI and the Surveillance, Epidemi- ology and End Results (SEER) Program (HHSN26120 1300021I), the CDC’s National Program of Cancer Registries (NPCR) (NU58DP006279-02-00) as well as the State of New Jersey and the Rutgers Cancer Institute of New Jersey.
Louisiana Tumor Registry data were collected using funding from NCI and the Surveillance, Epidemiology and End Results (SEER) Program (HHSN26120180000 7I), the CDC’s National Program of Cancer Registries (NPCR) (NU58DP006332-02-00) as well as the State of Louisiana.
Kentucky Cancer Registry data were collected with funding from NCI Surveillance, Epidemiology and End Results (SEER) Program (HHSN261201800013I), the CDC National Program of Cancer Registries (NPCR) (U58DP00003907) and the Commonwealth of Kentucky.
The Cancer Surveillance System is supported by the National Cancer Institute’s SEER Program (Contract Award HHSN261291800004I) and with additional funds provided by the Fred Hutchinson Cancer Research Center.
© 2022 - IOS Press. All rights reserved.
- artificial intelligence
- cancer epidemiology
- deep learning
- natural language processing
- privacy-preserving training
ASJC Scopus subject areas
- Cancer Research