首页    期刊浏览 2024年12月03日 星期二
登录注册

文章基本信息

  • 标题:Scripts and Numerals Identification From Printed Multilingual Document Images
  • 本地全文:下载
  • 作者:Abirami.S ; Murugappan. S
  • 期刊名称:Computer Science & Information Technology
  • 电子版ISSN:2231-5403
  • 出版年度:2011
  • 卷号:1
  • 期号:3
  • 页码:129-146
  • DOI:10.5121/csit.2011.1312
  • 出版社:Academy & Industry Research Collaboration Center (AIRCC)
  • 摘要:Identification of scripts from multi-script document is one of the important steps in the design of an OCR system for successful analysis and recognition. Most optical character recognition (OCR) systems can recognize at most a few scripts. But for large archives of document images containing different scripts, there must be some way to automatically categorize these documents before applying the proper OCR on them. Much work has already been reported in this area. In the Indian context, though some results have been reported, the task is still at its infancy. This paper presents a research in the identification of Tamil, English and Hindi scripts at word level irrespective of their font faces and sizes. It also identifies English numerals from multilingual document images. The proposed technique performs document vectorization method which generates vectors from the nine zones segmented over the characters based on their shape, density and transition features. Script is then determined by using Rule based classifiers and its sub classifiers containing set of classification rules which are raised from the vectors. The proposed system identifies scripts from document images even if it suffers from noise and other kinds of distortions. Results from experiments, simulations, and human vision encounter that the proposed technique identifies scripts and numerals with minimal pre-processing and high accuracy. In future, this can also be extended for other scripts.
  • 关键词:Document Images; Script Recognition; Classification; Document Image Understanding
国家哲学社会科学文献中心版权所有