III: Small: Rural: Querying Rich Uncertain Data in Real Time

  • Ge, Tingjian (PI)

Grants and Contracts Details


Rural: Querying Rich Uncertain Data in Real Time Summary: Compared to traditional deterministic data, uncertain data carries more information, indicating that the data cannot be completely cleansed. Forcing uncertain data to be deterministic (e.g., by taking the expectations) can cause significant information loss in query results, possibly leading to wrong judgments for queries that aid decision making. Furthermore, applications that demand fast, real-time responses can severely restrict the allowed time to acquire enough knowledge to cleanse the data and remove uncertainty. Therefore, obtaining the most precise and most informative answers to such queries within the given deadlines is a critical and challenging problem. Unfortunately, query processing on uncertain data is still an immature field and the techniques to date are ill-suited for processing real-time queries. Moreover, the problem of data cleansing, learning the distributions of uncertain data, using predictive models to infer distributions, and query processing on uncertain data are each studied independently. However, as raw data comes in continuously, all these components must happen in real time. Thus, we need to consider and optimize them as a whole in order to obtain the best execution plan and meet the real-time requirements. Our proposed project, called Rural (querying rich ,!!ncertain data in re!! time), aims to fill this gap. Rural has a rich treatment of distributions. In fact, unlike previous work, distributions are regarded as objects in order to achieve the optimization at the overall system level and to meet the goal of real-time and online query processing. Rural incorporates five new ideas: • A variety of query processing algorithms on uncertain data, some for general query types while others for more specific operators, such as the more expensive joins. We focus on (1) the online and fast processing aspects and (2) tight integration with data cleansing and distribution learning (e.g., through pipelining). • The treatment ofdistributions as objects: they can be compressed, shared, marginalized, and conditioned. The distributions themselves can be imprecise and evolving. • The usage of Bayesian networks and forecasting models to meet real-time query needs, and the system techniques to manage those models. • A query optimizer that considers various phases ofdata processing, as well as the usage ofmodels. • Answering top-k queries with consideration of typicality ofresults in terms oftheir preference scores.
Effective start/end date9/1/101/20/12


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.