TY - JOUR
T1 - Prediction of incident diabetes in the jackson heart study using high-dimensional machine learning
AU - Casanova, Ramon
AU - Saldana, Santiago
AU - Simpson, Sean L.
AU - Lacy, Mary E.
AU - Subauste, Angela R.
AU - Blackshear, Chad
AU - Wagenknecht, Lynne
AU - Bertoni, Alain G.
N1 - Publisher Copyright:
© 2016 Casanova et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2016/10
Y1 - 2016/10
N2 - Statistical models to predict incident diabetes are often based on limited variables. Here w pursued two main goals: 1) investigate the relative performance of a machine learnin method such as Random Forests (RF) for detecting incident diabetes in a high-dimensiona setting defined by a large set of observational data, and 2) uncover potential predictors o diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visit from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participant developed diabetes. The full RF model evaluated 93 variables including demographic anthropometric, blood biomarker, medical history, and echocardiogram data. We also use RF metrics of variable importance to rank variables according to their contribution to diabete prediction. We implemented other models based on logistic regression and RF wher features were preselected. The RF full model performance was similar (AUC = 0.82) t those more parsimonious models. The top-ranked variables according to RF include hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, an aldosterone. This work shows the potential of RF for incident diabetes prediction while dealin with high-dimensional data.
AB - Statistical models to predict incident diabetes are often based on limited variables. Here w pursued two main goals: 1) investigate the relative performance of a machine learnin method such as Random Forests (RF) for detecting incident diabetes in a high-dimensiona setting defined by a large set of observational data, and 2) uncover potential predictors o diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visit from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participant developed diabetes. The full RF model evaluated 93 variables including demographic anthropometric, blood biomarker, medical history, and echocardiogram data. We also use RF metrics of variable importance to rank variables according to their contribution to diabete prediction. We implemented other models based on logistic regression and RF wher features were preselected. The RF full model performance was similar (AUC = 0.82) t those more parsimonious models. The top-ranked variables according to RF include hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, an aldosterone. This work shows the potential of RF for incident diabetes prediction while dealin with high-dimensional data.
UR - http://www.scopus.com/inward/record.url?scp=84991511077&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84991511077&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0163942
DO - 10.1371/journal.pone.0163942
M3 - Article
C2 - 27727289
AN - SCOPUS:84991511077
SN - 1932-6203
VL - 11
JO - PLoS ONE
JF - PLoS ONE
IS - 10
M1 - e0163942
ER -