期刊名称:Advances in Science and Technology Research Journal
印刷版ISSN:2080-4075
电子版ISSN:2299-8624
出版年度:2020
卷号:14
期号:3
页码:30-38
DOI:10.12913/22998624/122567
语种:English
出版社:Society of Polish Mechanical Engineers and Technicians
摘要:In the modern world, fast and efficient processing of non-digital (handwritten or typed) texts is the task of extremeimportance. Similar to many other fields, optical character recognition (OCR) benefits from the application ofmachine learning (ML) which allows developing effective and accurate methods. In order to achieve good perfor-mance, a machine learning algorithm requires great amount of data. Nowadays, a large database of handwrittencharacters prepared by National Institute of Standards and Technology (NIST), USA, can be used for training anML model. However, significant differences between the manners of handwriting exist in the US and Poland. Thatfact, along with the absence of Polish diacritical marks, causes the NIST database to be less useful for developmentof an OCR model for the Polish language. According to the best of the authors’ knowledge, no database with sam-ples of Polish handwriting exists. The present research is focused at filling this gap, i.e. gathering and preparing anextensive database of Polish handwritten characters. The paper presents the very first database of Polish handwrit-ing samples. The database is by far larger than all the datasets used in the previous attempts of implementing OCRfor the Polish handwriting. It is also the first fully publicly accessible database of Polish handwriting of this scale.The same method and developed tools can be used to build handwritten characters databases of other languages.
关键词:OCR; Handwriting character samples; Database for optical character recognition; Polish handwritten characters database