Background: Over the last decade, metabolomics has evolved into a mainstream enterprise utilized by many laboratories globally. Like other " omics" data, metabolomics data has the characteristics of a smaller sample size compared to the number of features evaluated. Thus the selection of an optimal subset of features with a supervised classifier is imperative. We extended an existing feature selection algorithm, threshold gradient descent regularization (TGDR), to handle multi-class classification of " omics" data, and proposed two such extensions referred to as multi-TGDR. Both multi-TGDR frameworks were used to analyze a metabolomics dataset that compares the metabolic profiles of hepatocellular carcinoma (HCC) infected with hepatitis B (HBV) or C virus (HCV) with that of cirrhosis induced by HBV/HCV infection; the goal was to improve early-stage diagnosis of HCC.Results: We applied two multi-TGDR frameworks to the HCC metabolomics data that determined TGDR thresholds either globally across classes, or locally for each class. Multi-TGDR global model selected 45 metabolites with a 0% misclassification rate (the error rate on the training data) and had a 3.82% 5-fold cross-validation (CV-5) predictive error rate. Multi-TGDR local selected 48 metabolites with a 0% misclassification rate and a 5.34% CV-5 error rate.Conclusions: One important advantage of multi-TGDR local is that it allows inference for determining which feature is related specifically to the class/classes. Thus, we recommend multi-TGDR local be used because it has similar predictive performance and requires the same computing time as multi-TGDR global, but may provide class-specific inference.
|State||Published - Apr 4 2014|
Bibliographical noteFunding Information:
The study was supported by Natural Science Foundation of China (No 81172727 and 81202377). ST was also partially supported by a seed fund from the Jilin University (No 450060491885). We are grateful to two reviewers for their helpful comments and to Catherine Anthony for scientific editing. Especially, we thank Drs. Margaret MacDonald and Ype De Jong of the Rockefeller University for helpful discussion.
- Feature selection
- Hepatocellular carcinoma (HCC)
- Metabolic profile
- Multi-class classification
- Omics data
- Threshold gradient descent regularization (TGDR)
ASJC Scopus subject areas
- Structural Biology
- Molecular Biology
- Computer Science Applications
- Applied Mathematics