Using natural language processing to identify opioid use disorder in electronic health record data

Jade Singleton, Chengxi Li, Peter D. Akpunonu, Erin L. Abner, Anna M. Kucharska-Newton

Research output: Contribution to journalArticlepeer-review

11 Scopus citations


Background: As opioid prescriptions have risen, there has also been an increase in opioid use disorder (OUD) and its adverse outcomes. Accurate and complete epidemiologic surveillance of OUD, to inform prevention strategies, presents challenges. The objective of this study was to ascertain prevalence of OUD using two methods to identify OUD in electronic health records (EHR): applying natural language processing (NLP) for text mining of unstructured clinical notes and using ICD-10-CM diagnostic codes. Methods: Data were drawn from EHR records for hospital and emergency department patient visits to a large regional academic medical center from 2017 to 2019. International Classification of Disease, 10th Edition, Clinic Modification (ICD-10-CM) discharge codes were extracted for each visit. To develop the rule-based NLP algorithm, a stepwise process was used. First, a small sample of visits from 2017 was used to develop initial dictionaries. Next, EHR corresponding to 30,124 visits from 2018 were used to develop and evaluate the rule-based algorithm. A random sample of the results were manually reviewed to identify and address shortcomings in the algorithm, and to estimate sensitivity and specificity of the two methods of ascertainment. Last, the final algorithm was then applied to 29,212 visits from 2019 to estimate OUD prevalence. Results: While there was substantial overlap in the identified records (n = 1,381 [59.2 %]), overall n = 2,332 unique visits were identified. Of the total unique visits, 430 (18.4 %) were identified only by ICD-10-CM codes, and 521 (22.3 %) were identified only by NLP. The prevalence of visits with evidence of an OUD diagnosis in this sample, ascertained using only ICD-10-CM codes, was 1,811/29,212 (6.1 %). Including the additional 521 visits identified only by NLP, the estimated prevalence of OUD is 2,332/29,212 (7.9 %), an increase of 29.5 % compared to the use of ICD-10-CM codes alone. The estimated sensitivity and specificity of the NLP-based OUD classification were 81.8 % and 97.5 %, respectively, relative to gold-standard manual review by an expert addiction medicine physician. Conclusion: NLP-based algorithms can automate data extraction and identify evidence of opioid use disorder from unstructured electronic healthcare records. The most complete ascertainment of OUD in EHR was combined NLP with ICD-10-CM codes. NLP should be considered for epidemiological studies involving EHR data.

Original languageEnglish
Article number104963
JournalInternational Journal of Medical Informatics
StatePublished - Feb 2023

Bibliographical note

Publisher Copyright:
© 2022


  • Electronic healthcare records
  • Natural language processing
  • Opioid use disorder

ASJC Scopus subject areas

  • Health Informatics


Dive into the research topics of 'Using natural language processing to identify opioid use disorder in electronic health record data'. Together they form a unique fingerprint.

Cite this