Abstract
The rapid adoption of big data, machine learning (ML), and generative artificial intelligence (AI) in chemical discovery has heightened the importance of quantifying molecular similarity. Molecular similarity, commonly assessed as the distance between molecular fingerprints, is integral to applications such as database curation, diversity analysis, and property prediction. AI tools frequently rely on these similarity measures to cluster molecules under the assumption that structurally similar molecules exhibit similar properties. However, this assumption is not universally valid, particularly for continuous properties like electronic structure properties. Despite the prevalence of fingerprint-based similarity measures, their evaluation has largely depended on biological activity data sets and qualitative metrics, limiting their relevance for nonbiological domains. To address this gap, we propose a framework to evaluate the correlation between molecular similarity measures and molecular properties. Our approach builds on the concept of neighborhood behavior and incorporates kernel density estimation (KDE) analysis to quantify how well similarity measures capture property relationships. Using a data set of over 350 million molecule pairs with electronic structure, redox, and optical properties, we systematically evaluate the correlation between several molecular fingerprint generators, distance functions, and these properties. Both the curated data set and the evaluation framework are publicly available.
| Original language | English |
|---|---|
| Pages (from-to) | 4311-4319 |
| Number of pages | 9 |
| Journal | Journal of Chemical Information and Modeling |
| Volume | 65 |
| Issue number | 9 |
| DOIs | |
| State | Published - May 12 2025 |
Bibliographical note
Publisher Copyright:© 2025 American Chemical Society.
Funding
This work was generously supported by the National Science Foundation (NSF) under 2019574 and 2053760, the University of Kentucky Lyman T. Johnson fellowship, and the P.E.O. Scholars Award. Computational resources were provided through an NSF Extreme Science and Engineering Discovery Environment (XSEDE) Resource Allocation Award (CHE200119) and Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) DISCOVER Allocation Award (PHY220121). We further acknowledge the University of Kentucky (UK) Center for Computational Sciences and Information Technology Services Research Computing for their fantastic support and collaboration and use of the Lipscomb Compute Cluster and associated research computing resources. We thank Nirav Merchant for technical discussion on cyber infrastructure.
| Funders | Funder number |
|---|---|
| NSF Extreme Science and Engineering Discovery Environment | |
| University of Kentucky | |
| National Science Foundation Arctic Social Science Program | 2019574, 2053760 |
| XSEDE | PHY220121, CHE200119 |
ASJC Scopus subject areas
- General Chemistry
- General Chemical Engineering
- Computer Science Applications
- Library and Information Sciences
Fingerprint
Dive into the research topics of 'Evaluating Molecular Similarity Measures: Do Similarity Measures Reflect Electronic Structure Properties?'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver