TY - GEN
T1 - Approximate clustering on distributed data streams
AU - Qi, Zhang
AU - Jinze, Liu
AU - Wei, Wang
PY - 2008
Y1 - 2008
N2 - We Investigate the problem of clustering on distributed data streams. In particular, we consider the k-median clustering on stream data arriving at distributed sites which communicate through a routing tree. Distributed clustering on high speed data streams is a challenging task due to limited communication capacity, storage space, and computing power at each site. In this paper, we propose a suite of algorithms for computing (1 + ε)-approximate k-medlan clustering over distributed data streams under three different topology settings: topology-oblivious, height-aware, and path-aware. Our algorithms reduce the maximum per node transmission to polylog N (opposed to Ω(N) for transmitting the raw data). We have simulated our algorithms on a distributed stream system with both real and synthetic datasets composed of millions of data. In practice, our algorithms are able to reduce the data transmission to a small fraction of the original data. Moreover, our results Indicate that the algorithms are scalable with respect to the data volume, approximation factor, and the number of sites.
AB - We Investigate the problem of clustering on distributed data streams. In particular, we consider the k-median clustering on stream data arriving at distributed sites which communicate through a routing tree. Distributed clustering on high speed data streams is a challenging task due to limited communication capacity, storage space, and computing power at each site. In this paper, we propose a suite of algorithms for computing (1 + ε)-approximate k-medlan clustering over distributed data streams under three different topology settings: topology-oblivious, height-aware, and path-aware. Our algorithms reduce the maximum per node transmission to polylog N (opposed to Ω(N) for transmitting the raw data). We have simulated our algorithms on a distributed stream system with both real and synthetic datasets composed of millions of data. In practice, our algorithms are able to reduce the data transmission to a small fraction of the original data. Moreover, our results Indicate that the algorithms are scalable with respect to the data volume, approximation factor, and the number of sites.
UR - http://www.scopus.com/inward/record.url?scp=52649152978&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=52649152978&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2008.4497522
DO - 10.1109/ICDE.2008.4497522
M3 - Conference contribution
AN - SCOPUS:52649152978
SN - 9781424418374
T3 - Proceedings - International Conference on Data Engineering
SP - 1131
EP - 1139
BT - Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08
T2 - 2008 IEEE 24th International Conference on Data Engineering, ICDE'08
Y2 - 7 April 2008 through 12 April 2008
ER -