Abstract
Background: Recent natural language processing (NLP) research is dominated by neural network methods that employ word embeddings as basic building blocks. Pre-training with neural methods that capture local and global distributional properties (e.g., skip-gram, GLoVE) using free text corpora is often used to embed both words and concepts. Pre-trained embeddings are typically leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings. Objective: Despite advances in contextualized language model based embeddings, static word embeddings still form an essential starting point in BioNLP research and applications. They are useful in low resource settings and in lexical semantics studies. Our main goal is to build improved biomedical word embeddings and make them publicly available for downstream applications. Methods: We jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the transformer-based BERT architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts. Results: Both in qualitative and quantitative evaluations we demonstrate that our methods produce improved biomedical embeddings in comparison with other static embedding efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of biomedical embeddings to date with clear performance improvements across the board. Conclusion: We repurposed a transformer architecture (typically used to generate dynamic embeddings) to improve static biomedical word embeddings using concept correlations. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings.
Original language | English |
---|---|
Article number | 103867 |
Journal | Journal of Biomedical Informatics |
Volume | 120 |
DOIs | |
State | Published - Aug 2021 |
Bibliographical note
Publisher Copyright:© 2021 Elsevier Inc.
Funding
Research reported in this publication was supported by the National Library of Medicine of the U.S. National Institutes of Health under Award Number R01LM013240 . The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Funders | Funder number |
---|---|
National Institutes of Health (NIH) | |
U.S. National Library of Medicine | R01LM013240 |
U.S. National Library of Medicine |
Keywords
- Contextualized embeddings
- Fine-tuned embeddings
- Word embeddings
ASJC Scopus subject areas
- Health Informatics
- Computer Science Applications