TY - JOUR
T1 - An Open Natural Language Processing (NLP) Framework for EHR-based Clinical Research
T2 - A Case Demonstration Using the National COVID Cohort Collaborative (N3C)
AU - Liu, Sijia
AU - Wen, Andrew
AU - Wang, Liwei
AU - He, Huan
AU - Fu, Sunyang
AU - Miller, Robert
AU - Williams, Andrew
AU - Harris, Daniel
AU - Kavuluru, Ramakanth
AU - Liu, Mei
AU - Abu-El-Rub, Noor
AU - Schutte, Dalton
AU - Zhang, Rui
AU - Rouhizadeh, Masoud
AU - Osborne, John D
AU - He, Yongqun
AU - Topaloglu, Umit
AU - Hong, Stephanie S
AU - Saltz, Joel H
AU - Schaffter, Thomas
AU - Pfaff, Emily
AU - Chute, Christopher G
AU - Duong, Tim
AU - Haendel, Melissa A
AU - Fuentes, Rafael
AU - Szolovits, Peter
AU - Xu, Hua
AU - Liu, Hongfang
N1 - © The Author(s) 2023. Published by Oxford University Press on behalf of the American Medical Informatics Association.
PY - 2023/8/9
Y1 - 2023/8/9
N2 - Despite recent methodology advancements in clinical natural language processing (NLP), the adoption of clinical NLP models within the translational research community remains hindered by process heterogeneity and human factor variations. Concurrently, these factors also dramatically increase the difficulty in developing NLP models in multi-site settings, which is necessary for algorithm robustness and generalizability. Here, we reported on our experience developing an NLP solution for COVID-19 signs and symptom extraction in an open NLP framework from a subset of sites participating in the National COVID Cohort (N3C). We then empirically highlight the benefits of multi-site data for both symbolic and statistical methods, as well as highlight the need for federated annotation and evaluation to resolve several pitfalls encountered in the course of these efforts.
AB - Despite recent methodology advancements in clinical natural language processing (NLP), the adoption of clinical NLP models within the translational research community remains hindered by process heterogeneity and human factor variations. Concurrently, these factors also dramatically increase the difficulty in developing NLP models in multi-site settings, which is necessary for algorithm robustness and generalizability. Here, we reported on our experience developing an NLP solution for COVID-19 signs and symptom extraction in an open NLP framework from a subset of sites participating in the National COVID Cohort (N3C). We then empirically highlight the benefits of multi-site data for both symbolic and statistical methods, as well as highlight the need for federated annotation and evaluation to resolve several pitfalls encountered in the course of these efforts.
U2 - 10.1093/jamia/ocad134
DO - 10.1093/jamia/ocad134
M3 - Article
C2 - 37555837
SN - 1067-5027
JO - Journal of the American Medical Informatics Association : JAMIA
JF - Journal of the American Medical Informatics Association : JAMIA
ER -