CIF: Small: A Novel Paradigm of Information Extraction in Big Data Problems

  • Yin, Xiangrong (PI)

Grants and Contracts Details


OVERVIEW: The volume of big data with huge sample size and ultrahigh dimensionality is very popular in modern scientific fields. However, it brings much difficulties and challenges for analysis methods. In this proposal, the investigator will develop a novel information extraction system together with new sufficient dimension reduction (SDR) and sufficient variable selection (SVS) methods to broaden our understanding to those Big Data. The investigator proposes and develops a coherent collection of techniques for estimation, efficient computation, asymptotic studies, and statistical inference overcoming the new challenges. Theory and methodology developed in this proposal will lead to new research directions in SDR and SVS, and essentially data mining tools for Big Data. Keywords: data reduction and process; information extraction; Variable selection. INTELLECTUAL MERIT: The proposed investigation synthesizes and deepens some most important recent advances in the area of SDR and SVS with fast computing algorithms. It is a major step towards bringing the understanding of SDR and SVS, their advantages and how to overcome their disadvantages to suit for the challenge of big data analysis. More importantly, this synthesis provides data scientists a new platform to develop more flexible and efficient information extraction methods, and data modeling methodology in general. The overarching goal is to lead a new direction in SDR and SVS research and to investigate a set of innovative analytic tools for big data. In particular, the investigator aims at overcoming the limitations of current SDR and SVS methods and extending their scope in practice. Although the proposed research is motivated by and directly addresses the drawbacks of existing SDR and SVS methods for big data analysis, the methodology and theory developed here will open new research frontiers and can be applied to a wide range of scientific applications such as bioinformatics, pattern recognition, marketing, and environmental studies. The investigator has four specific goals:(a) develop novel and efficient methods for SDR and SVS, for traditional data (n > p); (b) study the theoretical properties of proposed approaches; (c) develop new algorithms of these methods for data with ultra-high dimensionality (p >> n) and large samples; and apply them to various scientific problems; (d) develop an open-access R package to disseminate the knowledge to the scientific community. The investigator believes his efforts will not only extend the understanding of those SDR and SVS methods, but also lead to fundamental advances in analyzing big data. BROADER IMPACTS: The proposed research addresses current emerging issues in information extraction from Big Data with huge sample size and ultrahigh dimensionality. These data occur frequently in many scientific fields, such as bioinformatics, machine learning, pattern recognition and environmental sciences. The proposed research will lead to deeper understanding of moderndata structure, help develop new data mining methods and advance statistical theory. The principle investigator proposes a new data analysis strategy for big data with new SDR and SVS methods so that it can be flexibly adapted to modern data structure. The investigator has a successful record of publication and experiences in statistical research of theory, methodology, and applications. The proposed investigations would provide an excellent opportunity for both undergraduate and graduate students (especially underrepresented minorities) to participate in cutting-edge statistical applications and methodology development, and thereby prepare them well for their future careers.
Effective start/end date10/1/188/11/20


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.