首页    期刊浏览 2024年07月03日 星期三
登录注册

文章基本信息

  • 标题:Buildingweb Page Collections Efficiently Exploiting Local Surrounding Pages
  • 本地全文:下载
  • 作者:Yuxin Wang, Keizo Oyama
  • 期刊名称:Progress in Informatics
  • 印刷版ISSN:1349-8614
  • 电子版ISSN:1349-8606
  • 出版年度:2009
  • 期号:06
  • DOI:10.2201/NiiPi.2009.6.4
  • 出版社:National Institute of Informatics
  • 摘要:

    This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher's homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce a logical page group structure concept that is represented by the relation between an entry page and its surrounding pages based on their connection type and relative URL directory level, and use the contents of local surrounding pages according to that concept. For the first process, we propose a very efficient method for comprehensively gathering all potential researchers' homepages from the web using property-based keyword lists. Four kinds of page group models (PGMs) based on the page group structure were used for merging the keywords from the surrounding pages. Although a lot of noise pages are included if we use keywords in the surrounding pages without considering the page group structure, the experimental results show that our method can reduce the increase of noise pages to an allowable level and can gather a significant number of the positive pages that could not be gathered using a single-page-based method. For the second process, we propose composing a three-grade classifier using two base classifiers: precision-assured and recall-assured. It classifies the input to assured positive, assured negative, and uncertain pages, where the uncertain pages need a manual assessment, so that the collection quality required by an application can be assured. Each of the base classifiers is further composed of a surrounding page classifier (SC) and an entry page classifier (EC). The SC selects likely component pages and the EC classifies the entry pages using information from both the entry page and the likely component pages. An evident performance improvement of the base classifiers by the introduction of the SC is shown through experiments. Then, the reduction of the number of uncertain pages is evaluated and the effectiveness of the proposed method is shown.

  • 关键词:Web page collections; page group model; logical page group structure; three-grade classifier; quality assurance; precision and recall
国家哲学社会科学文献中心版权所有