TY - GEN
T1 - Applications of robust regression to “big” data problems
AU - Sheather, Simon J.
PY - 2016
Y1 - 2016
N2 - Robust regression methods have many potential applications in big data problems. In this paper, we consider two such applications using publicly available data. The first application looks at modeling taxi fares based on the trip distance of n = 49; 800 taxi rides in New York City on Tuesday January 15, 2013. The second application focuses on modeling the airfare from the miles flown of n = 78; 905 round trip itineraries for single passengers which consisted of 2 direct one-way flights within the contiguous domestic US market on Southwest Airlines in the fourth quarter of 2014. The robust estimates were obtained for both applications using PROC ROBUSTREG in SAS 9.4. In both cases, we find that the confidence intervals around the robust estimates of the parameters in the regression models are very narrow, typically $0.01 or lower. With these confidence intervals being so narrow, one is left with the impression that these robust estimates differ in some meaningful way across at least some of the robust methods. Finally, utilizing findings in Cox (Biometrika, 102:712–716, 2015) we argue that in such applications it is not surprising that the confidence intervals around the robust estimates are very narrow, thus producing the illusion of apparently very high precision.
AB - Robust regression methods have many potential applications in big data problems. In this paper, we consider two such applications using publicly available data. The first application looks at modeling taxi fares based on the trip distance of n = 49; 800 taxi rides in New York City on Tuesday January 15, 2013. The second application focuses on modeling the airfare from the miles flown of n = 78; 905 round trip itineraries for single passengers which consisted of 2 direct one-way flights within the contiguous domestic US market on Southwest Airlines in the fourth quarter of 2014. The robust estimates were obtained for both applications using PROC ROBUSTREG in SAS 9.4. In both cases, we find that the confidence intervals around the robust estimates of the parameters in the regression models are very narrow, typically $0.01 or lower. With these confidence intervals being so narrow, one is left with the impression that these robust estimates differ in some meaningful way across at least some of the robust methods. Finally, utilizing findings in Cox (Biometrika, 102:712–716, 2015) we argue that in such applications it is not surprising that the confidence intervals around the robust estimates are very narrow, thus producing the illusion of apparently very high precision.
KW - Big data
KW - Illusion of very high precision
KW - Robust regression
KW - SAS
UR - http://www.scopus.com/inward/record.url?scp=84990032217&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84990032217&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-39065-9_6
DO - 10.1007/978-3-319-39065-9_6
M3 - Conference contribution
AN - SCOPUS:84990032217
SN - 9783319390635
T3 - Springer Proceedings in Mathematics and Statistics
SP - 101
EP - 120
BT - Robust Rank-Based and Nonparametric Methods - Selected, Revised, and Extended Contributions
A2 - McKean, Joseph W.
A2 - Liu, Regina Y.
T2 - International Conference on Robust Rank-Based and Nonparametric Methods, 2015
Y2 - 9 April 2015 through 10 April 2015
ER -