A parallel hybrid web document clustering algorithm and its performance study

Shuting Xu, Jun Zhang

Research output: Contribution to journalArticlepeer-review

25 Scopus citations

Abstract

Clustering web document is an important procedure in many web information retrieval systems. As the size of the Internet grows rapidly and the amount of information requests increases exponentially, the use of parallel computing techniques in large scale web document retrieval is unavoidable. We propose a parallel hybrid web document clustering algorithm, which combines the Principal Direction Divisive Partitioning (PDDP) algorithm with the K-means algorithm. Computational experiments were conducted to test the performance of the hybrid algorithm using three real life web document datasets, and the results were compared with that of the parallel PDDP algorithm and the parallel K-means algorithm. The experiments show that the quality of the clustering solutions obtained from the hybrid algorithm is better than that from the parallel PDDP or the parallel K-means. The parallel run time of the hybrid algorithm is similar to and sometimes less than that of the widely used K-means algorithm.

Original languageEnglish
Pages (from-to)117-131
Number of pages15
JournalJournal of Supercomputing
Volume30
Issue number2
DOIs
StatePublished - Nov 2004

Bibliographical note

Funding Information:
∗The research work of S. Xu was supported in part by the U.S. National Science Foundation under grant CCR-0092532. †The research work of J. Zhang was supported in part by the U.S. National Science Foundation under grants CCR-9988165, CCR-0092532, and ACR-0202934, by the U.S. Department of Energy Office of Science under grant DE-FG02-02ER45961, by the Kentucky Science & Engineering Foundation under grant KSEF-02-264-RED-002, by the Japanese Research Organization for Information Science & Technology, and by the University of Kentucky Research Committee.

Keywords

  • Information retrieval
  • K-means
  • PDDP
  • Parallel document clustering

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'A parallel hybrid web document clustering algorithm and its performance study'. Together they form a unique fingerprint.

Cite this