OCR post processing based on character pattern matching.
Boiangiu, Costin-Anton ; Cananau, Dan-Cristian ; Petrescu, Serban 等
1. INTRODUCTION
The domain of digital image processing is in a continuous expansion
and for this reason more and more solutions appear to different
problems. However, each problem solved creates a new possible
improvement and so new methods are developed all the time. Digital image
processing can be divided into a large number of sub categories that
deal with different aspects of a digital image. There are algorithms
dealing with different conversions from one colour scheme to another.
Other algorithms deal with the recognition of different elements found
in an image. The goal of this last type of algorithms is to simulate the
human eye by matching different patterns with already known models.
Matching patterns can vary from simple lines to more complicated
characters that contain a variety of geometrical shapes.
One of the most important areas of interest in digital image
processing is the optical character recognition, or simply OCR. There
are numerous software programs that implement and present such OCR
engines, some of them being open source, and each program has its
advantages and disadvantages. For example, some OCR engines are oriented
towards letter recognition, while others are performing better for text
line recognition. Whatever their characteristics are, all OCR engines
have to deal with word recognition and correction.
The result of the OCR software is usually a page or a fragment in
which the characters, words and elements found have been associated
correctly with an understandable language. For example, a deteriorated
newspaper page written in English should be reconstructed in order to be
readable.
Even though there are numerous programs that deal with different
aspects and have different features, there is no such thing as a perfect
OCR. Each algorithm has its own flaws. Some of the algorithms used may
fail on a certain type of document and combining all of them in order to
solve risen problems is simply not recommended because of the huge
running time that such a combined algorithm may have. And so the
necessity for pre- and post-processing has appeared.
2. PROBLEM STATEMENT
The two types of processing solve different problems before and
after passing a page or a fragment through the OCR engine. The
preprocessing deals with everything connected to page or fragment
preparation for the actual processing. Methods used in this section are
mostly oriented on image conversions and image processing such as
smoothing or noise removal. Post processing techniques are widely spread
and do not deal only with a certain aspect of the document. While some
processing deals with image improvement, another may deal with word
recognition and so on. Almost all algorithms used in the second category
require a dictionary or an alphabet which will be used for word
correction. A part of these algorithms can work for any language as long
as a dictionary is given (Kolak & Resnik, 2005), (Perez-Cortes et
al., 2000), (Zhuang & Zhu, 2005). Other algorithms are used only for
a given language, for example German (Wiedenhofer et al., 1995) or
Chinese (Long et al., 2006). A more complex approach could use both pre-
and post processing together with a feedback function that assesses the
final result of the two processing.
The aim of the algorithm presented in this paper is to improve the
results of the OCR through the post-processing step by using both
dictionary dependent and independent variations.
3. THE ALGORITHM
Most of the errors produced by the OCR engine are a result of
erroneous character recognition or of bad scanned documents. Most of the
words are detected; some percentages of them are not real words, but
variations of them. Variations are words that are very similar, but
differ only by a group of characters. The dictionary approach presented
in this paper improves the character results obtained by the actual OCR
processing.
3.1 The dictionary approach
In order to remove the variations and to find the correct form of
the word, in the first stage the algorithm takes each word in the
document and inserts it in a hash table with the hash function given by
the sum of the squares of the Unicode values of each character in the
word. The square has been chosen for optimization purposes. The hash
function assures a good spreading of the values in the hash table in
order to improve the data allocation of the entire algorithm. In the
same stage an input dictionary is necessary for pattern recognition. The
same operation of inserting the words in a hash table is used for the
dictionary. After this stage has been finished, the hash values are
compared with the hash values of a dictionary indexed in a similar hash
table. If the values match, the word is checked for the correct form
and, if that exists, the word is removed from the hash table because it
doesn't need correction. In the end of this step there will be a
hash table that contains all the erroneous words.
The next step is the actual correction. In another table all the
letters of the given alphabet and the possible combinations of two or
three letters are inserted. This will be called the alphabet of the
language. Each erroneous word is taken at a time and every letter or
group of two or three letters is replaced with a value from the alphabet
and then cross-referenced with the dictionary for a match. If only one
match is found the word is corrected with a high degree of certainty. If
more words are found at this moment, the algorithm computes a certainty
degree and chooses the word with the highest certainty as the solution.
The certainty degree can be computed in several manners depending on the
desired output. For example a fast but not reliable solution would be to
simply choose the first word found in the dictionary and accept it as
the correct solution. However, this approach is not recommended, because
errors are likely to occur. The certainty degree should be computed
using also semantic information if possible.
By using the algorithm presented so far the OCR correction has been
improved by 60%. However, further improvements can be made. The
operation of replacing a character or group of characters with another
character or group of the alphabet is called a substitution.
In order to improve the correction, two more operations are
introduced: the addition and the subtraction. The subtraction means the
elimination of a character from the word and cross-reference with the
dictionary. The addition implies adding one character from the alphabet
to the word and cross-reference with the dictionary. This is done for
each position in the word at a time and for each character in the
alphabet. By using all the three presented operations a large variety of
choices will be available for one word. The previous certainty degree
has to be calculated by taking into account the fact that there are
three different operations instead of just one. Again, a fast, but not
so reliable approach would be to compute the degree by assigning a
higher degree for simple operations, with the highest one being the
substitution, and a lower degree for combined operations. However, an
algorithm that takes into account semantic information is again
recommended.
In order to view the functionality of this algorithm the following
example is presented. In an English document the word "cat" is
not recognized correct and it is approximated with "cal" by
the OCR engine. By using the dictionary approach the erroneous word
would be easily found when cross referring it to the English dictionary.
By using only a substitution operation the variants "car" and
"cat" are found. By using all the three operations a large
variety of solutions is provided: "car", "cat",
"calf", "cold", "fall", "ball",
"mall", "hall" and many others variants. From all of
these variants a good semantically certainty allocation algorithm will
set "cat" with the highest probability and so the word will be
corrected. A similar approach has been presented in (Reynaert, 2008).
3.2 An independent approach
The presented algorithm was based on a given dictionary or alphabet
and so it assumed that the language used in the document is known. But
what happens when the language is unknown or if the purpose of passing
the document through the OCR is to find a new language. A new and
independent approach has to be developed.
And so, the next algorithm is used. As the previous one, it starts
from the words found by the OCR. It adds all of them into a table and
computes the frequency for each one. After this step the algorithm finds
all the similar variants from the table. In this stage only the words
that differ by one character will be considered variants.
After finding all the variants the most frequent one will be
considered correct and all the others will be corrected using this
value. The following example is considered. The words "cat",
"cas" and "ca" will be found in a document with
"cat" having the highest frequencies. All three variants shall
be included in the dictionary and after processing "ca" and
"cas" words found in the document will be replaced with
"cat" and removed from the dictionary. In the end the result
will be a newly created dictionary for the given document.
Such an approach can correct a document written in an unknown
language and can also create a new document for that language. However,
this approach has its downside. If the language is known by the human
reader, but not by the OCR, the variant found as correct by the post
processing may be erroneous because of the OCR detection of characters.
If in the previous example "cas" would have been found with
the highest frequency, "cat" would have been removed and
replaced by "cas" which would have lead to an erroneous
result.
4. CONCLUSIONS
The algorithm presents an interesting approach for OCR improvement
in the post processing stage. The main advantage is that the dictionary
approach is reliable and fast and improves the OCR correction by up to
90%. On the other side the independent approach is more intuitive and
even though it can prove to be unreliable in some cases, it is a very
good algorithm for unknown languages.
The problems with the independent approach appear on documents
written in a known language. The first improvement that can be made for
this algorithm is to create some sort of language recognition for
documents and when the correction is used the algorithm should signal
the fact that the language is known and that a dictionary is required
for proper functioning. This could be done by inserting in the algorithm
a small database with a few of the most common words found in different
languages.
However, the independent approach should perform well in the cases
where discovering a new language or dictionary for a language is a must.
The approaches require a certainty degree association function as
stated in the paper. One of the improvements and further research
related to the subject could be the creation of such a function with
respect to the semantic knowledge found in the sentences.
Further improvements can be made on the operations: introducing a
new operation or creating a selection algorithm that decides if it is
necessary to apply an operation or not. Also an algorithm for finding a
stopping step in the approaches may prove to be a good optimization.
To sum up, the algorithm is very useful in its current state and
can prove to be a good direction for further research in the OCR post
processing domain.
5. REFERENCES
Kolak, O. & Resnik, P. (2005). OCR post-processing for low
density languages, Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing, pp.
867-874, Vancouver, Canada, 2005, Association for Computational
Linguistics, Morristown, USA
Long, C., Zhu, X.; Huang, K.; Sun, J.; Hotta, Y. & Naoi, S.
(2006). An efficient post-processing approach for off-line handwritten Chinese address recognition, Proceedings of The 8th International
Conference on Signal Processing, pp. 16-20, ISBN: 0-7803-9736-3, 2006,
Beijing
Perez-Cortes, J.C.; Amengual, J.C.; Arlandis, J. & Llobet, R.
(2000). Stochastic error-correcting parsing for OCR post-processing,
Proceedings of the 15th International Conference on Pattern Recognition,
ICPR, vol. 4, ISSN: 1051-4651, 2000, IEEE Computer Society, Washington,
DC, USA
Reynaert, M. (2008). Lecture notes in computer science,
SPRINGER-VERLAG, pg. 617-630, ISSN 0302-9743, Numb 4919, 2008
Wiedenhofer, L.; Hein, H.-G. & Dengel, A. (1995).
Post-processing of OCR results for automatic indexing, Proceedings of
The Third International Conference on Document Analysis and Recognition
(ICDAR'95), ICDAR, vol. 2, pp.592-596, ISBN: 0-8186-7128-9, 1995,
IEEE Computer Society, Washington, DC, USA
Zhuang, L. & Zhu, X. (2005). Lecture notes in computer science,
SPRINGER-VERLAG, pg. 346-352, ISSN 03029743, Numb 3681, 2005