Abstract
Recent advance in technology enables researchers to gather and store enormous data sets with ultra high dimensionality. In bioinformatics, microarray and next generation sequencing technologies can produce data with tens of thousands of predictors of biomarkers. On the other hand, the corresponding sample sizes are often limited. For classification problems, to predict new observations with high accuracy, and to better understand the effect of predictors on classification, it is desirable, and often necessary, to train the classifier with variable selection. In the literature, sparse regularized classification techniques have been popular due to the ability of simultaneous classification and variable selection. Despite its success, such a sparse penalized method may have low computational speed, when the dimension of the problem is ultra high. To overcome this challenge, we propose a new sparse REgression based multicategory Classifier (REC). Our method uses a simplex to represent different categories of the classification problem. A major advantage of REC is that the optimization can be decoupled into smaller independent sparse penalized regression problems, and hence solved by using parallel computing. Consequently, REC enjoys an extraordinarily fast computational speed. Moreover, REC is able to provide class conditional probability estimation. Simulated examples and applications on microarray and next generation sequencing data suggest that REC is very competitive when compared to several existing methods.
Original language | English |
---|---|
Pages (from-to) | 175-185 |
Number of pages | 11 |
Journal | Statistics and its Interface |
Volume | 10 |
Issue number | 2 |
DOIs | |
State | Published - 2017 |
Bibliographical note
Funding Information:The authors would like to thank the Editor, Prof. Heping Zhang, for helpful suggestions. The authors were supported in part by US National Science Foundation and Engineering Research Council of Canada (NSERC), NSF grant DMS1407241, IIS1054631, NIH grants CA149569, HG06272, CA142538, P30CA177558, and National Natural Science Foundation of China (NSFC 61472475).
Funding
The authors would like to thank the Editor, Prof. Heping Zhang, for helpful suggestions. The authors were supported in part by US National Science Foundation and Engineering Research Council of Canada (NSERC), NSF grant DMS1407241, IIS1054631, NIH grants CA149569, HG06272, CA142538, P30CA177558, and National Natural Science Foundation of China (NSFC 61472475).
Funders | Funder number |
---|---|
National Institutes of Health (NIH) | P30CA177558, CA149569, CA142538, HG06272 |
National Institutes of Health (NIH) | |
Natural Sciences and Engineering Research Council of Canada | IIS1054631, DMS1407241 |
Natural Sciences and Engineering Research Council of Canada | |
National Natural Science Foundation of China (NSFC) | NSFC 61472475 |
National Natural Science Foundation of China (NSFC) |
Keywords
- LASSO
- Parallel computing
- Probability estimation
- Simplex
- Variable selection
ASJC Scopus subject areas
- Statistics and Probability
- Applied Mathematics