首页    期刊浏览 2025年02月21日 星期五
登录注册

文章基本信息

  • 标题:A Case-Based Recognition of Semantic Structures in HTML Documents Which Constitutes a Document Series Toward Automated Transformation from HTML to XML
  • 本地全文:下载
  • 作者:Masayuki Umehara ; Koji Iwanuma ; Hidetomo Nabashima
  • 期刊名称:人工知能学会論文誌
  • 印刷版ISSN:1346-0714
  • 电子版ISSN:1346-8030
  • 出版年度:2002
  • 卷号:17
  • 期号:6
  • 页码:690-698
  • DOI:10.1527/tjsai.17.690
  • 出版社:The Japanese Society for Artificial Intelligence
  • 摘要:The recognition and extraction of semantic/logical structures in HTML documents are substantially important and difficult tasks for intelligent document processing. In this paper, we show that the alignment technology is an appropriate tool, within a framework of case-based reasoning, for recognizing semantic structures inherently embedded in a series of HTML documents. That is, given a series of HTML documents and a document example of which semantic structures are explicitly indicated by a user, then the alignment can identify semantic structures in the HTML document series, by matching a text-block sequence in each HTML document with the text-block sequence in the example document. Several important properties in text documents, such as continuity, sequentiality of texts, can be treated by the alignment in a quite natural way.

    The alignment technology can significantly improve the capability of the case-based transformation method which transforms a spatial and/or temporal series of HTML documents into machine-readable XML formats. Moreover, the alignment dramatically eases the construction of transformation exmaples. Throughout experimental evaluation for 47 pages of 8 series of HTML documents, we show that the case-based method using the alignment achieved a highly accurate transformation into XML formats.

  • 关键词:alignment ; case-based transformation ; semantic structure ; HTML ; XML
国家哲学社会科学文献中心版权所有