Approximate clustering on distributed data streams

Zhang Qi, Liu Jinze, Wang Wei

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

31 Scopus citations

Abstract

We Investigate the problem of clustering on distributed data streams. In particular, we consider the k-median clustering on stream data arriving at distributed sites which communicate through a routing tree. Distributed clustering on high speed data streams is a challenging task due to limited communication capacity, storage space, and computing power at each site. In this paper, we propose a suite of algorithms for computing (1 + ε)-approximate k-medlan clustering over distributed data streams under three different topology settings: topology-oblivious, height-aware, and path-aware. Our algorithms reduce the maximum per node transmission to polylog N (opposed to Ω(N) for transmitting the raw data). We have simulated our algorithms on a distributed stream system with both real and synthetic datasets composed of millions of data. In practice, our algorithms are able to reduce the data transmission to a small fraction of the original data. Moreover, our results Indicate that the algorithms are scalable with respect to the data volume, approximation factor, and the number of sites.

Original languageEnglish
Title of host publicationProceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08
Pages1131-1139
Number of pages9
DOIs
StatePublished - 2008
Event2008 IEEE 24th International Conference on Data Engineering, ICDE'08 - Cancun, Mexico
Duration: Apr 7 2008Apr 12 2008

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627

Conference

Conference2008 IEEE 24th International Conference on Data Engineering, ICDE'08
Country/TerritoryMexico
CityCancun
Period4/7/084/12/08

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'Approximate clustering on distributed data streams'. Together they form a unique fingerprint.

Cite this