文章基本信息

标题：OCR post processing based on character pattern matching.
作者：Boiangiu, Costin-Anton ; Cananau, Dan-Cristian ; Petrescu, Serban 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：The domain of digital image processing is in a continuous expansion and for this reason more and more solutions appear to different problems. However, each problem solved creates a new possible improvement and so new methods are developed all the time. Digital image processing can be divided into a large number of sub categories that deal with different aspects of a digital image. There are algorithms dealing with different conversions from one colour scheme to another. Other algorithms deal with the recognition of different elements found in an image. The goal of this last type of algorithms is to simulate the human eye by matching different patterns with already known models. Matching patterns can vary from simple lines to more complicated characters that contain a variety of geometrical shapes.
关键词：Image processing;Optical character recognition

OCR post processing based on character pattern matching.

Boiangiu, Costin-Anton ; Cananau, Dan-Cristian ; Petrescu, Serban 等

1. INTRODUCTION

The domain of digital image processing is in a continuous expansion and for this reason more and more solutions appear to different problems. However, each problem solved creates a new possible improvement and so new methods are developed all the time. Digital image processing can be divided into a large number of sub categories that deal with different aspects of a digital image. There are algorithms dealing with different conversions from one colour scheme to another. Other algorithms deal with the recognition of different elements found in an image. The goal of this last type of algorithms is to simulate the human eye by matching different patterns with already known models. Matching patterns can vary from simple lines to more complicated characters that contain a variety of geometrical shapes.

One of the most important areas of interest in digital image processing is the optical character recognition, or simply OCR. There are numerous software programs that implement and present such OCR engines, some of them being open source, and each program has its advantages and disadvantages. For example, some OCR engines are oriented towards letter recognition, while others are performing better for text line recognition. Whatever their characteristics are, all OCR engines have to deal with word recognition and correction.

The result of the OCR software is usually a page or a fragment in which the characters, words and elements found have been associated correctly with an understandable language. For example, a deteriorated newspaper page written in English should be reconstructed in order to be readable.

Even though there are numerous programs that deal with different aspects and have different features, there is no such thing as a perfect OCR. Each algorithm has its own flaws. Some of the algorithms used may fail on a certain type of document and combining all of them in order to solve risen problems is simply not recommended because of the huge running time that such a combined algorithm may have. And so the necessity for pre- and post-processing has appeared.

2. PROBLEM STATEMENT

The two types of processing solve different problems before and after passing a page or a fragment through the OCR engine. The preprocessing deals with everything connected to page or fragment preparation for the actual processing. Methods used in this section are mostly oriented on image conversions and image processing such as smoothing or noise removal. Post processing techniques are widely spread and do not deal only with a certain aspect of the document. While some processing deals with image improvement, another may deal with word recognition and so on. Almost all algorithms used in the second category require a dictionary or an alphabet which will be used for word correction. A part of these algorithms can work for any language as long as a dictionary is given (Kolak & Resnik, 2005), (Perez-Cortes et al., 2000), (Zhuang & Zhu, 2005). Other algorithms are used only for a given language, for example German (Wiedenhofer et al., 1995) or Chinese (Long et al., 2006). A more complex approach could use both pre- and post processing together with a feedback function that assesses the final result of the two processing.

The aim of the algorithm presented in this paper is to improve the results of the OCR through the post-processing step by using both dictionary dependent and independent variations.

3. THE ALGORITHM

Most of the errors produced by the OCR engine are a result of erroneous character recognition or of bad scanned documents. Most of the words are detected; some percentages of them are not real words, but variations of them. Variations are words that are very similar, but differ only by a group of characters. The dictionary approach presented in this paper improves the character results obtained by the actual OCR processing.

3.1 The dictionary approach

In order to remove the variations and to find the correct form of the word, in the first stage the algorithm takes each word in the document and inserts it in a hash table with the hash function given by the sum of the squares of the Unicode values of each character in the word. The square has been chosen for optimization purposes. The hash function assures a good spreading of the values in the hash table in order to improve the data allocation of the entire algorithm. In the same stage an input dictionary is necessary for pattern recognition. The same operation of inserting the words in a hash table is used for the dictionary. After this stage has been finished, the hash values are compared with the hash values of a dictionary indexed in a similar hash table. If the values match, the word is checked for the correct form and, if that exists, the word is removed from the hash table because it doesn't need correction. In the end of this step there will be a hash table that contains all the erroneous words.

The next step is the actual correction. In another table all the letters of the given alphabet and the possible combinations of two or three letters are inserted. This will be called the alphabet of the language. Each erroneous word is taken at a time and every letter or group of two or three letters is replaced with a value from the alphabet and then cross-referenced with the dictionary for a match. If only one match is found the word is corrected with a high degree of certainty. If more words are found at this moment, the algorithm computes a certainty degree and chooses the word with the highest certainty as the solution. The certainty degree can be computed in several manners depending on the desired output. For example a fast but not reliable solution would be to simply choose the first word found in the dictionary and accept it as the correct solution. However, this approach is not recommended, because errors are likely to occur. The certainty degree should be computed using also semantic information if possible.

By using the algorithm presented so far the OCR correction has been improved by 60%. However, further improvements can be made. The operation of replacing a character or group of characters with another character or group of the alphabet is called a substitution.

In order to improve the correction, two more operations are introduced: the addition and the subtraction. The subtraction means the elimination of a character from the word and cross-reference with the dictionary. The addition implies adding one character from the alphabet to the word and cross-reference with the dictionary. This is done for each position in the word at a time and for each character in the alphabet. By using all the three presented operations a large variety of choices will be available for one word. The previous certainty degree has to be calculated by taking into account the fact that there are three different operations instead of just one. Again, a fast, but not so reliable approach would be to compute the degree by assigning a higher degree for simple operations, with the highest one being the substitution, and a lower degree for combined operations. However, an algorithm that takes into account semantic information is again recommended.

In order to view the functionality of this algorithm the following example is presented. In an English document the word "cat" is not recognized correct and it is approximated with "cal" by the OCR engine. By using the dictionary approach the erroneous word would be easily found when cross referring it to the English dictionary. By using only a substitution operation the variants "car" and "cat" are found. By using all the three operations a large variety of solutions is provided: "car", "cat", "calf", "cold", "fall", "ball", "mall", "hall" and many others variants. From all of these variants a good semantically certainty allocation algorithm will set "cat" with the highest probability and so the word will be corrected. A similar approach has been presented in (Reynaert, 2008).

3.2 An independent approach

The presented algorithm was based on a given dictionary or alphabet and so it assumed that the language used in the document is known. But what happens when the language is unknown or if the purpose of passing the document through the OCR is to find a new language. A new and independent approach has to be developed.

And so, the next algorithm is used. As the previous one, it starts from the words found by the OCR. It adds all of them into a table and computes the frequency for each one. After this step the algorithm finds all the similar variants from the table. In this stage only the words that differ by one character will be considered variants.

After finding all the variants the most frequent one will be considered correct and all the others will be corrected using this value. The following example is considered. The words "cat", "cas" and "ca" will be found in a document with "cat" having the highest frequencies. All three variants shall be included in the dictionary and after processing "ca" and "cas" words found in the document will be replaced with "cat" and removed from the dictionary. In the end the result will be a newly created dictionary for the given document.

Such an approach can correct a document written in an unknown language and can also create a new document for that language. However, this approach has its downside. If the language is known by the human reader, but not by the OCR, the variant found as correct by the post processing may be erroneous because of the OCR detection of characters. If in the previous example "cas" would have been found with the highest frequency, "cat" would have been removed and replaced by "cas" which would have lead to an erroneous result.

4. CONCLUSIONS

The algorithm presents an interesting approach for OCR improvement in the post processing stage. The main advantage is that the dictionary approach is reliable and fast and improves the OCR correction by up to 90%. On the other side the independent approach is more intuitive and even though it can prove to be unreliable in some cases, it is a very good algorithm for unknown languages.

The problems with the independent approach appear on documents written in a known language. The first improvement that can be made for this algorithm is to create some sort of language recognition for documents and when the correction is used the algorithm should signal the fact that the language is known and that a dictionary is required for proper functioning. This could be done by inserting in the algorithm a small database with a few of the most common words found in different languages.

However, the independent approach should perform well in the cases where discovering a new language or dictionary for a language is a must.

The approaches require a certainty degree association function as stated in the paper. One of the improvements and further research related to the subject could be the creation of such a function with respect to the semantic knowledge found in the sentences.

Further improvements can be made on the operations: introducing a new operation or creating a selection algorithm that decides if it is necessary to apply an operation or not. Also an algorithm for finding a stopping step in the approaches may prove to be a good optimization.

To sum up, the algorithm is very useful in its current state and can prove to be a good direction for further research in the OCR post processing domain.

5. REFERENCES

Kolak, O. & Resnik, P. (2005). OCR post-processing for low density languages, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 867-874, Vancouver, Canada, 2005, Association for Computational Linguistics, Morristown, USA

Long, C., Zhu, X.; Huang, K.; Sun, J.; Hotta, Y. & Naoi, S. (2006). An efficient post-processing approach for off-line handwritten Chinese address recognition, Proceedings of The 8th International Conference on Signal Processing, pp. 16-20, ISBN: 0-7803-9736-3, 2006, Beijing

Perez-Cortes, J.C.; Amengual, J.C.; Arlandis, J. & Llobet, R. (2000). Stochastic error-correcting parsing for OCR post-processing, Proceedings of the 15th International Conference on Pattern Recognition, ICPR, vol. 4, ISSN: 1051-4651, 2000, IEEE Computer Society, Washington, DC, USA

Reynaert, M. (2008). Lecture notes in computer science, SPRINGER-VERLAG, pg. 617-630, ISSN 0302-9743, Numb 4919, 2008

Wiedenhofer, L.; Hein, H.-G. & Dengel, A. (1995). Post-processing of OCR results for automatic indexing, Proceedings of The Third International Conference on Document Analysis and Recognition (ICDAR'95), ICDAR, vol. 2, pp.592-596, ISBN: 0-8186-7128-9, 1995, IEEE Computer Society, Washington, DC, USA

Zhuang, L. & Zhu, X. (2005). Lecture notes in computer science, SPRINGER-VERLAG, pg. 346-352, ISSN 03029743, Numb 3681, 2005