Structural analysis and classification of search interfaces for the deep web

Vasilis Kolias, Ioannis Anagnostopoulos, Sherali Zeadally

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

The Web has been identified to consist of a large portion of content that cannot be crawled by general-purpose search engines because it is only generated after a valid submission to a search interface. Accessing such content, however, requires the location and identification of search interfaces. Towards the automation of this task, many approaches have been proposed that involve the manual definition of rules for the identification of query interfaces. In this paper, we propose a rule induction approach to automatically construct a set of rules by searching the most promising subspace of all possible rules with a brute-force method and information theoretic criteria. To specify the features for the rules, we initially make a descriptive analysis of Yahoo L11, a specialized dataset containing complex interfaces, which to the best of our knowledge has not been used in previous works. We perform a series of evaluations and present the rules constructed by running the algorithm on a random sample of the Yahoo L11 dataset and another dataset used in similar works. The resulting rules yield high classification accuracy in predicting the functionality of new, previously unseen forms and since humans can easily interpret them, they can be easily ported to any application as-is.

Original languageEnglish
Pages (from-to)386-398
Number of pages13
JournalComputer Journal
Volume61
Issue number3
DOIs
StatePublished - Mar 1 2018

Bibliographical note

Publisher Copyright:
© 2018 The British Computer Society. All rights reserved.

Keywords

  • deep web
  • rule induction
  • search interfaces

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Structural analysis and classification of search interfaces for the deep web'. Together they form a unique fingerprint.

Cite this