Natural Language Processing Platform for Cancer Surveillance

Grants and Contracts Details


This UG3/UH3 proposal is in response to Research Area 1 of PAR 16-349 specifically addressing the development of natural language processing (NLP) tools to facilitate automatic/unsupervised/minimally supervised extraction of specific discrete cancer-related data from various types of unstructured electronic medical records (EMRs). The current proposal builds on the work under a supplement funded by the NCI Informatics Tools for Cancer Research (ITCR) program ( - U24CA132672 parent project) on the advanced development of a software platform for performing deep phenotype extraction directly from EMR of patients with cancer (DeepPhe,, May 1, 2014-April 30, 2019). We investigated and implemented methods for the extraction of (1) cancer containers with attributes for location, laterality, temporality, clinical stage, pTNM, cTNM, debulking (for OvCa), degree of debulking (for OvCa), (2) tumor containers with attributes for location, laterality, diagnosis, type, size, biomarkers, genes/proteins, calcification, clockface position (for brCa), quadrant (for brCa), cancer type, histologic type, tumor extent, Breslow depth (for melanoma), ulceration (for melanoma). Value sets are normalized to NCIt which provides mappings to ICD-O. The DeepPhe platform visualizes the output in individual and cohort views with direct links to source text. The development of the scientific methods and functionalities of the DeepPhe platform have been guided by the driving biology projects led by our clinical domain experts and SEER collaborators in melanoma, breast and ovarian cancer. Our SEER collaborators are the Kentucky (KY) and Louisiana (LA) Cancer Registries led by Drs. Durbin and Wu respectively, as well as the central SEER registry led by Dr. Penberthy. We will build on this work, advancing NLP methods for information extraction of clinical phenotyping data needed to fuel a new cancer surveillance paradigm benefitting hospital-based, state-based and national registries. In this new paradigm, surveillance programs would use software to enhance the speed, accuracy, and ease of cancer reporting. The proposed DeepPhe*CR software would be deployed at local sites or centrally, and could eventually be integrated into existing or new visualization and abstraction tools as needed by the cancer surveillance community. In the UG3 phase (July 1, 2019 – June 30, 2021), we continue our partnership with the KY Cancer Registry led by Dr. Durbin. In the UH3 phase (July 1, 2021 – June 30, 2024), we will continue our collaboration with the LA Cancer Registry led by Dr. Wu and will establish a new partnership with the Massachusetts (MA) Cancer Registry led by Dr. Gershman. The MA Cancer Registry joined the SEER family in May 2018, and thus presents an excellent testbed for deploying tools to new SEER registries. For our use cases, we will focus on these types of cancers – breast, lung, melanoma, prostate, colorectal, ovarian. Through the collaboration with three SEER registries, we will not only develop methods to assist abstraction but will demonstrate the scalability of our tools through the integration in the SEER*DMS editor. In collaboration with NCI and SEER, we have established these performance goals: > F1 0.75 for machine-assisted human validation and >F1 0.95 for full automation. Other criteria include process efficiency improvements and registrars’ satisfaction with the tools and process. We propose the following specific aims: Aim 1: Develop methods for the extraction of the cancer and tumor characteristics currently manually abstracted by the registries from a variety of data sources, e.g. pathology reports which have been traditionally used by the registries, the growing amounts of radiology reports which are new data streams for most registries, and genomic test reports. (UG3 phase with continued improvements throughout UH3 phase) Aim 2: Extract treatment information from various channels: (1) sources available to the registries such as CDA transmissions, NAACR data elements, documentation generated at the clinical encounter (e.g. radiation treatment centers), (2) EMR documentation at the point-of-care through running the tool directly at the point-of-care and transmitting the extracted information to the registries. The extracted treatment information will be mapped to ontologies such as RxNORM, LOINC and HemOnc. (UG3 phase with continued improvements throughout UH3 phase) Aim 3: Investigate methods for the extraction of clinical genomics information from (1) XML data feeds from sequencing providers such as Foundation Medicine, (2) oncotype feeds, (3) pathology notes (UG3 phase with continued improvements throughout UH3 phase) Aim 4: Develop software architectures and tools in support of integrating best-performing DeepPhe methods from SA1-3 into registry abstraction tools. We will develop REST APIS and containerized implementations of DeepPhe, to be used in consultation with IMS and in-house software teams to create links between DeepPhe and registry tools, allowing DeepPhe data and visualizations to be seamlessly incorporated into registry workflows. Innovation and Potential Impact: This work continues the theme of the SEER supplement of the parent DeepPhe project to research and develop a novel platform for extracting deep phenotype information within emerging data science environments. Although there has been some previous work on automatic phenotype extraction from the clinical narrative for specific types of cancer or individual variables, the proposed work will be a step towards a generalizable information extraction framework that can be applied for both research and surveillance purposes. This generalizability enables extensibility and scalability. Interoperability is reinforced through the modeling part of the proposed project which is grounded in most recent advances in biomedical ontologies, terminologies, and community conventions and standards. Our partnership with three SEER cancer registries provides our development processes with a solid foundation in large scale cancer surveillance
Effective start/end date7/19/196/30/20


  • Boston Childrens Hospital: $82,554.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.