TY - JOUR
T1 - Development and validation of natural language processing algorithms in the national ENACT network
AU - Wang, Yanshan
AU - Hilsman, Jordan
AU - Li, Chenyu
AU - Morris, Michele
AU - Heider, Paul M
AU - Fu, Sunyang
AU - Kwak, Min Ji
AU - Wen, Andrew
AU - Applegate, Joseph R
AU - Wang, Liwei
AU - Bernstam, Elmer
AU - Liu, Hongfang
AU - Chang, Jack
AU - Harris, Daniel R
AU - Corbeau, Alexandria
AU - Henderson, Darren
AU - Osborne, John
AU - Kennedy, Richard E
AU - Garduno-Rapp, Nelly-Estefanie
AU - Rousseau, Justin F
AU - Yan, Chao
AU - Chen, You
AU - Patel, Mayur B
AU - Murphy, Tyler J
AU - Malin, Bradley A
AU - Park, Chan Mi
AU - Fan, Jungwei W
AU - Sohn, Sunghwan
AU - Pagali, Sandeep
AU - Peng, Yifan
AU - Pathak, Aman
AU - Wu, Yonghui
AU - Xia, Zongqi
AU - Loguercio, Salvatore
AU - Reis, Steven E
AU - Visweswaran, Shyam
N1 - © The Author(s) 2025.
PY - 2025
Y1 - 2025
N2 - OBJECTIVE: Electronic Health Record (EHR) data are critical for advancing translational research and AI technologies. The ENACT network offers access to structured EHR data across 57 CTSA hubs. However, substantial information is contained in clinical narratives, requiring natural language processing (NLP) for research. The ENACT NLP Working Group was formed to make NLP-derived clinical information accessible and queryable across the network.METHODS: We established the ENACT NLP Working Group with 13 sites selected based on criteria including clinical notes access, IT infrastructure, NLP expertise, and institutional support. We divided sites into five focus groups targeting clinical tasks within disease contexts. Each focus group consisted of two development sites and two validation sites. We extended the ENACT ontology to standardize NLP-derived data and conducted multisite evaluations using the Open Health Natural Language Processing (OHNLP) Toolkit.RESULTS: The working group achieved 100% site retention and deployed NLP infrastructure across all sites. We developed and validated NLP algorithms for rare disease phenotyping, social determinants of health, opioid use disorder, sleep phenotyping, and delirium phenotyping. Performance varied across sites (F1 scores 0.53-0.96), highlighting data heterogeneity impacts. We extended the ENACT common data model and ontology to incorporate NLP-derived data while maintaining Shared Health Research Informatics NEtwork (SHRINE) compatibility.CONCLUSION: This demonstrates feasibility of deploying NLP infrastructure across large, federated networks. The focus group approach proved more practical than general-purpose approaches. Key lessons include the challenge of data heterogeneity and importance of collaborative governance. This work also provides a foundation that other networks can build on to implement NLP capabilities for translational research.
AB - OBJECTIVE: Electronic Health Record (EHR) data are critical for advancing translational research and AI technologies. The ENACT network offers access to structured EHR data across 57 CTSA hubs. However, substantial information is contained in clinical narratives, requiring natural language processing (NLP) for research. The ENACT NLP Working Group was formed to make NLP-derived clinical information accessible and queryable across the network.METHODS: We established the ENACT NLP Working Group with 13 sites selected based on criteria including clinical notes access, IT infrastructure, NLP expertise, and institutional support. We divided sites into five focus groups targeting clinical tasks within disease contexts. Each focus group consisted of two development sites and two validation sites. We extended the ENACT ontology to standardize NLP-derived data and conducted multisite evaluations using the Open Health Natural Language Processing (OHNLP) Toolkit.RESULTS: The working group achieved 100% site retention and deployed NLP infrastructure across all sites. We developed and validated NLP algorithms for rare disease phenotyping, social determinants of health, opioid use disorder, sleep phenotyping, and delirium phenotyping. Performance varied across sites (F1 scores 0.53-0.96), highlighting data heterogeneity impacts. We extended the ENACT common data model and ontology to incorporate NLP-derived data while maintaining Shared Health Research Informatics NEtwork (SHRINE) compatibility.CONCLUSION: This demonstrates feasibility of deploying NLP infrastructure across large, federated networks. The focus group approach proved more practical than general-purpose approaches. Key lessons include the challenge of data heterogeneity and importance of collaborative governance. This work also provides a foundation that other networks can build on to implement NLP capabilities for translational research.
UR - https://www.scopus.com/pages/publications/105014128819
UR - https://www.scopus.com/inward/citedby.url?scp=105014128819&partnerID=8YFLogxK
U2 - 10.1017/cts.2025.10116
DO - 10.1017/cts.2025.10116
M3 - Article
C2 - 40979101
VL - 9
SP - e199
JO - Journal of Clinical and Translational Science
JF - Journal of Clinical and Translational Science
IS - 1
ER -