首页    期刊浏览 2024年11月23日 星期六
登录注册

文章基本信息

  • 标题:AUTOMATIC DOCUMENT STRUCTURE ANALYSIS OF STRUCTURED PDF FILES
  • 本地全文:下载
  • 作者:Rosmayati Mohemad ; Abdul Razak Hamdan ; Zulaiha Ali Othman
  • 期刊名称:International Journal of New Computer Architectures and their Applications
  • 印刷版ISSN:2220-9085
  • 出版年度:2011
  • 卷号:1
  • 期号:2
  • 页码:404-411
  • 出版社:Society of Digital Information and Wireless Communications
  • 摘要:Portable Document Format (PDF) is the most comfortable way to publish information because of its operating system independent. However, information on PDF document is unstructured and are applicable only for human reader. In addition, PDF consists of non-tagged internal structure which make the extraction task difficult. Automatically details analyzing and recognizing of PDF document structures especially paragraph and tabular area is vital for extracting relevant information precisely for use in other domain applications. Motivation of this study is to support knowledge extraction and exploit its actual semantic for improving further analysis. This paper proposed an intelligent approach to identify and recognize automatically the layout and structure of PDF documents together with their text and then structure the extracted information into ontological- based representation. An experimental study has been conducted using a collection of construction tender documents in PDF to test the performance of the proposed approach. The accuracies of precision, recall and f-measures have shown significant results when detecting tabular and paragraph structure.
  • 关键词:Document Analysis; Information Extraction; ; Portable Document Format
国家哲学社会科学文献中心版权所有