Modern preprocessing techniques for automatic content conversion systems.
Boiangiu, Costin Anton ; Dvornic, Andrei Iulian
1. INTRODUCTION
Document image analysis has increasingly become a very active
research area due to the need of both converting old documents to
digital media and acquiring useful and on-time information in electronic
and searchable format. To this goal, faster and better methods for
automatic content conversion need to be developed as variations in
layout, orientation, size, quality and characteristics of printed
documents (both in real and electronic form) made this task a very
complicated one. In order to cope with all these aspects, latest
developments in this domain require image binarization before the actual
features extraction in order to reduce the computational load and
simplify the analysis methods (Chang, 2001). As a result, in this paper
we focus on proposing a series of conversion techniques for different
types of documents, as well as some preprocessing algorithms which can
increase the quality of bitonal transformations.
Global thresholding and adaptive binarization (Sheikh et al., 2005)
have proven to be the most popular binarization methods so far.
Unfortunately, apart from having their limitations (it is very hard to
find a global threshold or a predefined window size of analysis for all
types of documents), this techniques are not addressing critical issues
specific to electronic documents (detecting text inside images or uneven
background, converting areas where both the text and background is
multi-coloured, etc.) and hence are not a viable solution for this type
of conversions.
In this view, we propose three methods which are aimed at solving
some of the problems areas on which actual methods fail.
2. CONTRAST PREPROCESSING TECHNIQUE
Bitonal conversion using contrast preprocessing is a two-stage
method which tries to solve problems like large brightness variations
and low contrast ratios in scanned documents. These issues, which are
the direct result of both document degradation and poor calibration of
automatic scanners, are requiring an adaptive local approach because any
global thresholding attempt will most probably fail on some parts of the
input document. The first stage of the algorithm is an auto-stretch
contrast step which is performed using two succesive iterations through
the input document: on the first one the contrast stretch bounds are
computed and then the actual transformation is performed.
The stretch bounds are computed by appling a horizontal threshold
on the intensity histogram(s) of the document. For grayscale images, the
histogram is computed by taking into consideration the frequency of each
pixel value (in the case of color documents individual histograms are
considered for each color channel).The bounds of the longest horizontal
segment cutting the histogram(s) at the threshold level are used as
references for the contrast stretch. If some color indexes are missing
from the histogram, a triangular filter is applied repeatedly before the
longest segment estimation, until all the color values have a
representation in the histogram.
From the efectiveness point of view, experimental results performed
on scanned books and newspapers from British Library have shown that
conversion of documents with good contrast ratio might not need this
preprocessing step. The increase in precision during the conversion does
not compensate for the time cost of such preliminary transformation.
Apart from that, this technique increases the contrast ratio between the
background and the foreground pixels without performing a contrast
equalization. Hence if an effective black and white conversion is
desired, the second stage of the binarization process must be a local
one.
Taking all of the above into consideration, the second stage of the
proposed technique is an actual binarization process, based on adaptive
leveling of the preprocessed document. By this, we mean applying a
threshold to the difference between the image obtained after the first
stage (involving the stretch of the document's contrast) and a
leveled version of it, obtained using one of the following three
methods: Gaussian unsharp (Dobrescu et al., 2006),
downsampling/upsampling (Fischer, 2000) or local contrast stretch. The
basic idea behind this is that areas in the image which belong to the
background will have intensity values below the leveled ones, whereas
the pixels belonging to the foreground will have intensity values above
them.
In the following example we show the results of all three leveling
methods. In the case of downsampling/upsampling the following
interpolation modes have been tested: Lanczos, Hermite, Triangle,
Mitchell, Bell and B-Spline. The last one yield the best results.
[FIGURE 1 OMITTED]
3. MULTI-THRESHOLD CONVERSION
Multi-threshold conversion is a binarization technique proper for
documents in which the noise aspect is very sensitive. This includes
both documents which have large background variations and documents
where OCR detection of text needs to be very precise (Wenzel &
Grigat 2005).
Four different intermediary threshold-based conversions are used in
order to obtain the final black and white document: threshold applied to
the grayscale transformation of the image; threshold applied to the K
component of the CMYK color model; thresholding each component of the
RGB color model with the minimum average intensity of channels and a
variable thresholding technique based on the hue component. The least
accurate intermediary conversion is considered to be that conversion
which outputs the least number of black pixels. Starting from this, the
proposed technique adds recursively undetected foreground pixels,
provided that they are detected in one of the other intermediary images
and are neighboring a foreground pixel which has been already detected.
The final conversion is obtained when no more pixels can be classified
as belonging to the foreground based on the proposed algorithm.
Experiments have been performed in order to compare multi-threshold
method with both adaptive and global thresholding. It was noticed that
the noise level was obviously lower when using the proposed method due
to the threshold levels of the intermediary conversions which were
considered such as to be "safe" (this means that an
intermediary conversion would contain the least level of noise possible,
even if some relevant information would be lost--this missing
information would be recovered from the other intermediary versions). In
this way the final black and white documents was obtained similar to a
puzzle, using pieces from the previously computed intermediary images in
which undesired pixels level was very low.
4. ELECTRONIC DOCUMENTS CONVERSION
In the case of electronic documents (like PDFs and online
newspapers) the focus of attention during the binarization process is
shifted from aspects like contrast or noise to translating the colour
combinations of texts, images and backgrounds into a meaningful black
and white version. To this goal three different approaches for black and
white transformation of electronic documents have been tried.
The first method is the "outside-in" local technique.
This algorithm starts from the outer white background and decides to
convert everything which neighbours it black; then everything that
neighbours the previously detected area is set to white and so on until
the whole document is processed. This approach works better than the
threshold-based global ones, avoiding the disappearance of letters
printed with colours that are scarce in the picture. Apart from that,
using this technique cancels the risk of errors generated by the same
colour playing both the role of background and foreground in different
parts of the document.
The "alternate colours" approach is also a local
conversion method which tries to solve the binarization decision using
horizontal scan lines. This iterative process takes each pixel at a time
(on each individual horizontal line in the input document) and decides
its final value based on the previous made decisions. Conversion to
white or black is alternated each time a colour shift occurs. Whenever a
white or black pixel is encountered in the original document, the output
colour is automatically set to that value, no matter the alternation rule. This algorithm manages to solve some cases in which the
"outside-in" method would fail, like bi-coloured areas in
which the background and foreground are combinations of two distinct
colours. A threshold is used in order decide when a colour shift has
occurred and output colour must be changed.
[FIGURE 2 OMITTED]
The third and final proposed method is an edge-detection based
technique. This algorithm uses a threshold in order to decide if a pixel
is part of an edge between foreground and background or not. In this
way, the problems resulting from background variations can be reduced by
adjusting the threshold. Research is still carried out in order to find
a way to decide which of the detected contours are belonging to
foreground and how they could be filled up with black pixels in order to
generate a more relevant conversion.
5. CONCLUSIONS
Due to the increasing interest in document content conversion there
is the need for new image binarization methods that can cope with
problem areas in a large variety of documents. As the best technique to
be chosen depends on the desired output and the set of images to be
converted, we have tried to present in this paper a series of
alternatives to current algorithms.
Further research must still be conducted in order to improve
current methods, as well as develop new techniques that can cope with
the new challenges of modern document analysis.
6. REFERENCES
Chang, F. (2001). Retrieving information from document images:
problems and solutions, International Journal on Document Analysis and
Recognition, Vol. 4, No.1, (August 2001) pp. 46-55, ISSN 1433-2833
Dobrescu, R.; Dobrescu, M.; Mocanu, S. & Taralunga, S. (2006).
Development platform for parallel image processing, Proceedings of the
6th WSEAS International Conference on Signal, Speech and Image
Processing, pp. 31-36, ISBN 123-4567-99-3, Portugal, September 2006,
WSEAS Press, Lisbon
Fischer, S. (2000). Digital Image Processing: Skewing and
Thresholding, Master of Science thesis, University of New South Wales,
Sydney, Australia
Sheikh, L.M.; Hassan, I.; Sheikh, N.Z.; Bashir, R.A.; Khan, S.A.
& Khan, S.S. (2005). An Adaptive Multi-Thresholding Technique for
Binarization of Color Images, Proceedings of the 9th WSEAS International
Conference on Computers, Article No. 104, ISBN 960-8457-29-7, Greece,
July 2005, WSEAS Press, Athens
Wenzel, F. & Grigat, R.R. (2005). A Framework for Developing
Image Processing Algorithms with Minimal Overhead, Proceedings of the
5th WSEAS International Conference on Signal, Speech and Image
Processing, pp. 185-190, ISBN 940-4271-11-5, Greece, August 2005, WSEAS
Press, Corfu Island