Automatic Extraction of ICD-O-3 Primary Sites from Cancer Pathology Reports

Producción científica: Articlerevisión exhaustiva

Resumen

Although registry specific requirements exist, cancer registries primarily identify reportable cases using a combination of particular ICD-O-3 topography and morphology codes assigned to cancer case abstracts of which free text pathology reports form a main component. The codes are generally extracted from pathology reports by trained human coders, sometimes with the help of software programs. Here we present results that improve on the state-of-the-art in automatic extraction of 57 generic sites from pathology reports using three representative machine learning algorithms in text classification. We use a dataset of 56,426 reports arising from 35 labs that report to the Kentucky Cancer Registry. Employing unigrams, bigrams, and named entities as features, our methods achieve a class-based micro F-score of 0.9 and macro F-score of 0.72. To our knowledge, this is the best result on extracting ICD-O-3 codes from pathology reports using a large number of possible codes. Given the large dataset we use (compared to other similar efforts) with reports from 35 different labs, we also expect our final models to generalize better when extracting primary sites from previously unseen reports.

Idioma originalEnglish
Páginas (desde-hasta)112-6
Número de páginas5
PublicaciónAMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science
Volumen2013
EstadoPublished - 2013

Huella

Profundice en los temas de investigación de 'Automatic Extraction of ICD-O-3 Primary Sites from Cancer Pathology Reports'. En conjunto forman una huella única.

Citar esto