首页    期刊浏览 2024年10月04日 星期五
登录注册

文章基本信息

  • 标题:Region Based Data Extraction
  • 本地全文:下载
  • 作者:Pui Leng Goh ; Jer Lang Hong ; Ee Xion Tan
  • 期刊名称:Communications of the IBIMA
  • 电子版ISSN:1943-7765
  • 出版年度:2013
  • 卷号:2013
  • DOI:10.5171/2013.743515
  • 出版社:IBIMA Publishing
  • 摘要:Wrappers are tools used to extract relevant information from HTML pages. Current approaches use DOM tree, visual cue, and ontology to extract data. DOM tree based techniques are fast and simple. However, they are not as accurate as visual based wrappers due to lack of additional information needed to perform data extraction. Visual based wrappers, on the other hand, are slow due to the extra processing needed to obtain visual cue from the underlying browser rendering engine. Ontology based wrappers are accurate, but they are also slow and need manual tuning to operate them. In this paper, we propose a novel visual based wrapper to extract information from HTML pages. Our wrapper uses visual cue to eliminate unnecessary regions, hence reduces the running time of extraction task as our wrapper only needs to consider the relevant region for extraction. Then, our wrapper removes irrelevant data from the relevant region using visual cue. Experiment results show that our wrapper outperforms state-of-the-art wrapper WISH in data extraction.
  • 关键词:Automatic Wrapper; Search Engines; Deep Web.
国家哲学社会科学文献中心版权所有