文章基本信息

标题：A Case-Based Recognition of Semantic Structures in HTML Documents Which Constitutes a Document Series Toward Automated Transformation from HTML to XML
本地全文：下载
作者：Masayuki Umehara ; Koji Iwanuma ; Hidetomo Nabashima 等
期刊名称：人工知能学会論文誌
印刷版ISSN：1346-0714
电子版ISSN：1346-8030
出版年度：2002
卷号：17
期号：6
页码：690-698
DOI：10.1527/tjsai.17.690
出版社：The Japanese Society for Artificial Intelligence
摘要：The recognition and extraction of semantic/logical structures in HTML documents are substantially important and difficult tasks for intelligent document processing. In this paper, we show that the alignment technology is an appropriate tool, within a framework of case-based reasoning, for recognizing semantic structures inherently embedded in a series of HTML documents. That is, given a series of HTML documents and a document example of which semantic structures are explicitly indicated by a user, then the alignment can identify semantic structures in the HTML document series, by matching a text-block sequence in each HTML document with the text-block sequence in the example document. Several important properties in text documents, such as continuity, sequentiality of texts, can be treated by the alignment in a quite natural way.
The alignment technology can significantly improve the capability of the case-based transformation method which transforms a spatial and/or temporal series of HTML documents into machine-readable XML formats. Moreover, the alignment dramatically eases the construction of transformation exmaples. Throughout experimental evaluation for 47 pages of 8 series of HTML documents, we show that the case-based method using the alignment achieved a highly accurate transformation into XML formats.
关键词：alignment ; case-based transformation ; semantic structure ; HTML ; XML