Statistical models to predict incident diabetes are often based on limited variables. Here w pursued two main goals: 1) investigate the relative performance of a machine learnin method such as Random Forests (RF) for detecting incident diabetes in a high-dimensiona setting defined by a large set of observational data, and 2) uncover potential predictors o diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visit from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participant developed diabetes. The full RF model evaluated 93 variables including demographic anthropometric, blood biomarker, medical history, and echocardiogram data. We also use RF metrics of variable importance to rank variables according to their contribution to diabete prediction. We implemented other models based on logistic regression and RF wher features were preselected. The RF full model performance was similar (AUC = 0.82) t those more parsimonious models. The top-ranked variables according to RF include hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, an aldosterone. This work shows the potential of RF for incident diabetes prediction while dealin with high-dimensional data.
|State||Published - Oct 2016|
Bibliographical notePublisher Copyright:
© 2016 Casanova et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
ASJC Scopus subject areas
- Biochemistry, Genetics and Molecular Biology (all)
- Agricultural and Biological Sciences (all)