文章基本信息

标题：Modern preprocessing techniques for automatic content conversion systems.
作者：Boiangiu, Costin Anton ; Dvornic, Andrei Iulian
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2008
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Document image analysis has increasingly become a very active research area due to the need of both converting old documents to digital media and acquiring useful and on-time information in electronic and searchable format. To this goal, faster and better methods for automatic content conversion need to be developed as variations in layout, orientation, size, quality and characteristics of printed documents (both in real and electronic form) made this task a very complicated one. In order to cope with all these aspects, latest developments in this domain require image binarization before the actual features extraction in order to reduce the computational load and simplify the analysis methods (Chang, 2001). As a result, in this paper we focus on proposing a series of conversion techniques for different types of documents, as well as some preprocessing algorithms which can increase the quality of bitonal transformations.
关键词：Algorithms

Modern preprocessing techniques for automatic content conversion systems.

Boiangiu, Costin Anton ; Dvornic, Andrei Iulian

1. INTRODUCTION

Document image analysis has increasingly become a very active research area due to the need of both converting old documents to digital media and acquiring useful and on-time information in electronic and searchable format. To this goal, faster and better methods for automatic content conversion need to be developed as variations in layout, orientation, size, quality and characteristics of printed documents (both in real and electronic form) made this task a very complicated one. In order to cope with all these aspects, latest developments in this domain require image binarization before the actual features extraction in order to reduce the computational load and simplify the analysis methods (Chang, 2001). As a result, in this paper we focus on proposing a series of conversion techniques for different types of documents, as well as some preprocessing algorithms which can increase the quality of bitonal transformations.

Global thresholding and adaptive binarization (Sheikh et al., 2005) have proven to be the most popular binarization methods so far. Unfortunately, apart from having their limitations (it is very hard to find a global threshold or a predefined window size of analysis for all types of documents), this techniques are not addressing critical issues specific to electronic documents (detecting text inside images or uneven background, converting areas where both the text and background is multi-coloured, etc.) and hence are not a viable solution for this type of conversions.

In this view, we propose three methods which are aimed at solving some of the problems areas on which actual methods fail.

2. CONTRAST PREPROCESSING TECHNIQUE

Bitonal conversion using contrast preprocessing is a two-stage method which tries to solve problems like large brightness variations and low contrast ratios in scanned documents. These issues, which are the direct result of both document degradation and poor calibration of automatic scanners, are requiring an adaptive local approach because any global thresholding attempt will most probably fail on some parts of the input document. The first stage of the algorithm is an auto-stretch contrast step which is performed using two succesive iterations through the input document: on the first one the contrast stretch bounds are computed and then the actual transformation is performed.

The stretch bounds are computed by appling a horizontal threshold on the intensity histogram(s) of the document. For grayscale images, the histogram is computed by taking into consideration the frequency of each pixel value (in the case of color documents individual histograms are considered for each color channel).The bounds of the longest horizontal segment cutting the histogram(s) at the threshold level are used as references for the contrast stretch. If some color indexes are missing from the histogram, a triangular filter is applied repeatedly before the longest segment estimation, until all the color values have a representation in the histogram.

From the efectiveness point of view, experimental results performed on scanned books and newspapers from British Library have shown that conversion of documents with good contrast ratio might not need this preprocessing step. The increase in precision during the conversion does not compensate for the time cost of such preliminary transformation. Apart from that, this technique increases the contrast ratio between the background and the foreground pixels without performing a contrast equalization. Hence if an effective black and white conversion is desired, the second stage of the binarization process must be a local one.

Taking all of the above into consideration, the second stage of the proposed technique is an actual binarization process, based on adaptive leveling of the preprocessed document. By this, we mean applying a threshold to the difference between the image obtained after the first stage (involving the stretch of the document's contrast) and a leveled version of it, obtained using one of the following three methods: Gaussian unsharp (Dobrescu et al., 2006), downsampling/upsampling (Fischer, 2000) or local contrast stretch. The basic idea behind this is that areas in the image which belong to the background will have intensity values below the leveled ones, whereas the pixels belonging to the foreground will have intensity values above them.

In the following example we show the results of all three leveling methods. In the case of downsampling/upsampling the following interpolation modes have been tested: Lanczos, Hermite, Triangle, Mitchell, Bell and B-Spline. The last one yield the best results.

[FIGURE 1 OMITTED]

3. MULTI-THRESHOLD CONVERSION

Multi-threshold conversion is a binarization technique proper for documents in which the noise aspect is very sensitive. This includes both documents which have large background variations and documents where OCR detection of text needs to be very precise (Wenzel & Grigat 2005).

Four different intermediary threshold-based conversions are used in order to obtain the final black and white document: threshold applied to the grayscale transformation of the image; threshold applied to the K component of the CMYK color model; thresholding each component of the RGB color model with the minimum average intensity of channels and a variable thresholding technique based on the hue component. The least accurate intermediary conversion is considered to be that conversion which outputs the least number of black pixels. Starting from this, the proposed technique adds recursively undetected foreground pixels, provided that they are detected in one of the other intermediary images and are neighboring a foreground pixel which has been already detected. The final conversion is obtained when no more pixels can be classified as belonging to the foreground based on the proposed algorithm.

Experiments have been performed in order to compare multi-threshold method with both adaptive and global thresholding. It was noticed that the noise level was obviously lower when using the proposed method due to the threshold levels of the intermediary conversions which were considered such as to be "safe" (this means that an intermediary conversion would contain the least level of noise possible, even if some relevant information would be lost--this missing information would be recovered from the other intermediary versions). In this way the final black and white documents was obtained similar to a puzzle, using pieces from the previously computed intermediary images in which undesired pixels level was very low.

4. ELECTRONIC DOCUMENTS CONVERSION

In the case of electronic documents (like PDFs and online newspapers) the focus of attention during the binarization process is shifted from aspects like contrast or noise to translating the colour combinations of texts, images and backgrounds into a meaningful black and white version. To this goal three different approaches for black and white transformation of electronic documents have been tried.

The first method is the "outside-in" local technique. This algorithm starts from the outer white background and decides to convert everything which neighbours it black; then everything that neighbours the previously detected area is set to white and so on until the whole document is processed. This approach works better than the threshold-based global ones, avoiding the disappearance of letters printed with colours that are scarce in the picture. Apart from that, using this technique cancels the risk of errors generated by the same colour playing both the role of background and foreground in different parts of the document.

The "alternate colours" approach is also a local conversion method which tries to solve the binarization decision using horizontal scan lines. This iterative process takes each pixel at a time (on each individual horizontal line in the input document) and decides its final value based on the previous made decisions. Conversion to white or black is alternated each time a colour shift occurs. Whenever a white or black pixel is encountered in the original document, the output colour is automatically set to that value, no matter the alternation rule. This algorithm manages to solve some cases in which the "outside-in" method would fail, like bi-coloured areas in which the background and foreground are combinations of two distinct colours. A threshold is used in order decide when a colour shift has occurred and output colour must be changed.

[FIGURE 2 OMITTED]

The third and final proposed method is an edge-detection based technique. This algorithm uses a threshold in order to decide if a pixel is part of an edge between foreground and background or not. In this way, the problems resulting from background variations can be reduced by adjusting the threshold. Research is still carried out in order to find a way to decide which of the detected contours are belonging to foreground and how they could be filled up with black pixels in order to generate a more relevant conversion.

5. CONCLUSIONS

Due to the increasing interest in document content conversion there is the need for new image binarization methods that can cope with problem areas in a large variety of documents. As the best technique to be chosen depends on the desired output and the set of images to be converted, we have tried to present in this paper a series of alternatives to current algorithms.

Further research must still be conducted in order to improve current methods, as well as develop new techniques that can cope with the new challenges of modern document analysis.

6. REFERENCES

Chang, F. (2001). Retrieving information from document images: problems and solutions, International Journal on Document Analysis and Recognition, Vol. 4, No.1, (August 2001) pp. 46-55, ISSN 1433-2833

Dobrescu, R.; Dobrescu, M.; Mocanu, S. & Taralunga, S. (2006). Development platform for parallel image processing, Proceedings of the 6th WSEAS International Conference on Signal, Speech and Image Processing, pp. 31-36, ISBN 123-4567-99-3, Portugal, September 2006, WSEAS Press, Lisbon

Fischer, S. (2000). Digital Image Processing: Skewing and Thresholding, Master of Science thesis, University of New South Wales, Sydney, Australia

Sheikh, L.M.; Hassan, I.; Sheikh, N.Z.; Bashir, R.A.; Khan, S.A. & Khan, S.S. (2005). An Adaptive Multi-Thresholding Technique for Binarization of Color Images, Proceedings of the 9th WSEAS International Conference on Computers, Article No. 104, ISBN 960-8457-29-7, Greece, July 2005, WSEAS Press, Athens

Wenzel, F. & Grigat, R.R. (2005). A Framework for Developing Image Processing Algorithms with Minimal Overhead, Proceedings of the 5th WSEAS International Conference on Signal, Speech and Image Processing, pp. 185-190, ISBN 940-4271-11-5, Greece, August 2005, WSEAS Press, Corfu Island