首页    期刊浏览 2024年11月25日 星期一
登录注册

文章基本信息

  • 标题:Automatic Table Recognition and Extraction from Heterogeneous Documents
  • 本地全文:下载
  • 作者:Florence Folake Babatunde ; Bolanle Adefowoke Ojokoh ; Samuel Adebayo Oluwadare
  • 期刊名称:Journal of Computer and Communications
  • 印刷版ISSN:2327-5219
  • 电子版ISSN:2327-5227
  • 出版年度:2015
  • 卷号:03
  • 期号:12
  • 页码:100-110
  • DOI:10.4236/jcc.2015.312009
  • 语种:English
  • 出版社:Scientific Research Publishing
  • 摘要:This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction.
  • 关键词:Hidden Markov Model;Table Recognition and Extraction;Hypertext Markup Language;Heterogeneous Documents
国家哲学社会科学文献中心版权所有