首页    期刊浏览 2024年07月03日 星期三
登录注册

文章基本信息

  • 标题:Combining Page Group Structure and Content for Roughly Filtering Researchers' Homepages with High Recall
  • 本地全文:下载
  • 作者:Yuxin Wang ; Keizo Oyama
  • 期刊名称:Information and Media Technologies
  • 电子版ISSN:1881-0896
  • 出版年度:2006
  • 卷号:1
  • 期号:2
  • 页码:1060-1072
  • DOI:10.11185/imt.1.1060
  • 出版社:Information and Media Technologies Editorial Board
  • 摘要:This paper proposes a method for gathering researchers' homepages(or entry pages) by applying new simple and effective page group models for exploiting the mutual relations between the structure and content of a page group, aiming at narrowing down the candidates with a very high recall. First, 12 property-based keyword lists that correspond to researchers' common properties are created and are assigned either organization-related or other. Next, several page group models (PGMs) are introduced taking into consideration the link structure and URL hierarchy. Although the application of PGMs generally causes a lot of noises, modified PGMs with two original techniques are introduced to reduce these noises. Then, based on the PGMs, the keywords are propagated to a potential entry page from its surrounding pages, composing a virtual entry page. Finally, the virtual entry pages that score at least a threshold number are selected. The effectiveness of the method is shown by comparing it to a single-page-based method through experiments using a 100GB web data set and a manually created sample data set.
国家哲学社会科学文献中心版权所有