Abstract
Clustering web document is an important procedure in many web information retrieval systems. As the size of the Internet grows rapidly and the amount of information requests increases exponentially, the use of parallel computing techniques in large scale web document retrieval is unavoidable. We propose a parallel hybrid web document clustering algorithm, which combines the Principal Direction Divisive Partitioning (PDDP) algorithm with the K-means algorithm. Computational experiments were conducted to test the performance of the hybrid algorithm using three real life web document datasets, and the results were compared with that of the parallel PDDP algorithm and the parallel K-means algorithm. The experiments show that the quality of the clustering solutions obtained from the hybrid algorithm is better than that from the parallel PDDP or the parallel K-means. The parallel run time of the hybrid algorithm is similar to and sometimes less than that of the widely used K-means algorithm.
Original language | English |
---|---|
Pages (from-to) | 117-131 |
Number of pages | 15 |
Journal | Journal of Supercomputing |
Volume | 30 |
Issue number | 2 |
DOIs | |
State | Published - Nov 2004 |
Bibliographical note
Funding Information:∗The research work of S. Xu was supported in part by the U.S. National Science Foundation under grant CCR-0092532. †The research work of J. Zhang was supported in part by the U.S. National Science Foundation under grants CCR-9988165, CCR-0092532, and ACR-0202934, by the U.S. Department of Energy Office of Science under grant DE-FG02-02ER45961, by the Kentucky Science & Engineering Foundation under grant KSEF-02-264-RED-002, by the Japanese Research Organization for Information Science & Technology, and by the University of Kentucky Research Committee.
Keywords
- Information retrieval
- K-means
- PDDP
- Parallel document clustering
ASJC Scopus subject areas
- Theoretical Computer Science
- Software
- Information Systems
- Hardware and Architecture