文章基本信息

标题：AUTOMATIC DOCUMENT STRUCTURE ANALYSIS OF STRUCTURED PDF FILES
本地全文：下载
作者：Rosmayati Mohemad ; Abdul Razak Hamdan ; Zulaiha Ali Othman 等
期刊名称：International Journal of New Computer Architectures and their Applications
印刷版ISSN：2220-9085
出版年度：2011
卷号：1
期号：2
页码：404-411
出版社：Society of Digital Information and Wireless Communications
摘要：Portable Document Format (PDF) is the most comfortable way to publish information because of its operating system independent. However, information on PDF document is unstructured and are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. Motivation of this study is to support knowledge extraction and exploit its actual semantic for improving further analysis. This paper proposed an intelligent approach to identify and recognize automatically the layout and structure of PDF documents together with their text and then structure the extracted information into ontological- based representation. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.
关键词：Document Analysis; Information Extraction; ; Portable Document Format