Abstract
Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug"for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.
Original language | English |
---|---|
Title of host publication | Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 |
ISBN (Electronic) | 9781450384506 |
DOIs | |
State | Published - Jan 18 2021 |
Event | 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 - Virtual, Online, United States Duration: Aug 1 2021 → Aug 4 2021 |
Publication series
Name | Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 |
---|
Conference
Conference | 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 |
---|---|
Country/Territory | United States |
City | Virtual, Online |
Period | 8/1/21 → 8/4/21 |
Bibliographical note
Publisher Copyright:© 2021 ACM.
Keywords
- biomedical natural language processing
- deep neural networks
- entity normalization
- information extraction
- named entity recognition
ASJC Scopus subject areas
- Computer Science Applications
- Software
- Biomedical Engineering
- Health Informatics