文章基本信息

标题：Systematic printing space recognition.
作者：Boiangiu, Costin-Anton ; Petrescu, Serban Barbu ; Moldoveanu, Alin Dragos Bogdan 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2010
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Playing a major role in high volume document digitization (Yacoub et al., 2005), print space recognition is very important both from the qualitative point of view, as well as from the overall time-performance and costs of the digitization process perspective (Di Zenzo et al., 1996). As shown by the tests which have been performed, the technique shown in this paper is suitable for most types of input documents.
关键词：Imaging;Imaging systems;Optical scanners;Scanning devices

Systematic printing space recognition.

Boiangiu, Costin-Anton ; Petrescu, Serban Barbu ; Moldoveanu, Alin Dragos Bogdan 等

1. INTRODUCTION

Playing a major role in high volume document digitization (Yacoub et al., 2005), print space recognition is very important both from the qualitative point of view, as well as from the overall time-performance and costs of the digitization process perspective (Di Zenzo et al., 1996). As shown by the tests which have been performed, the technique shown in this paper is suitable for most types of input documents.

Further research will be aimed at improving the proposed algorithm by constraining the detection of fixed-dimensional cropping frames for a series of related scanned documents. Moreover, an extension of the algorithm which takes into consideration the particularities of high-quality color digital images of documents (as opposed to greyscale) will be studied.

2. ALGORITHM OUTLINE

Scanning books and documents flat against the glass of a scanner cannot be a viable solution for many document digitization projects (Breuel, 2003). Many documents are quite fragile and pressing them firmly in order to ensure that the entire page is flat on the glass of a scanner might lead to further, unacceptable damage. Furthermore, many books, particularly textbooks, have gutters (the inner margins next to the book's spine) very close to the binding. (Zhang & Tan, 2005). This makes it very difficult to scan the pages without unbinding the book (so the pages are separated and then can lie flat on the scanning glass).

[FIGURE 1 OMITTED]

As cutting off of the bindings of books and magazines was not an option for very old and uncommon books (Zheng et al., 2003), document-digitization companies had to turn to software driven machines and robots, developed to automatically scan books without the need of disbinding them (Kirtas, ATIZ, ScanRobot SR301, etc.). This type of machines allow for both the content and the state of a document to be preserved, by capturing high-quality digital images of each individual page. The proposed algorithm is trying to achieve print space recognition and cropping of digital images of documents, obtained using automatic scanning machines.

The first step of the algorithm is the removal of outer noise (such as scanning-machine's details or exterior margins outside the document in focus). This is performed by computing the sum of the grayscale values for each individual column/row, and then applying two thresholds as follows: all lines/columns having the sum of grayscale values lower than 60%/30% from the maximum sum of grayscale values for an individual line/column are filled with black (all pixels are turned to black).

After the first stage has been completed, a score is computed for each individual line/column by adding up the grayscale values of all pixels. These sums are altered by adding an extra value of 200 for each pixel with value greater than 200 in order to increase the score-gap between line/columns intersecting text and lines/column representing white spaces. This gap needs to be high in order to cope with brightness variations in input documents (resulting from the position of light sources in automatic scanning devices for example).

The left/rightmost margins of the document are investigated in order to determine whether the document is a single/double page one. For example, if the right margin is made a series of black columns (while the left margin is not), the document is classified as a single page one, situated on the left side. Similarly, images can be classified as single page documents sittuated on the right or double-page documents.

[FIGURE 2 OMITTED]

The upper/lower bounds of the cropping rectangle are determined by scanning the document from top/bottom and stopping whenever a line with a score higher than 30% of the maximum score of all lines is detected. In order to cope with a possible slight inclination of the document in the image, an extra correction of:

0.005 * (bottom bound--top bound) (1)

is added/subtracted to/from the previously identified cropping.

If the input document contains a single page, the following method is used in order to determine the vertical cropping bounds. Suppose the page is situated on the right. Starting from the right margin, the image is scanned column by column until a column with a score higher than 90% from the maximum score of all columns is found--this is set as the right cropping bound. The scan than continues either:

* for further 10% of the image's width

* until a column with a very low score is encountered. If condition of the second case is met, the scan is restarted from the current index (in this case there is a very high probability that part of the gutter of the book was encountered). Otherwise the right cropping bound remains unchanged. The same process is then repeated correspondingly in order to detect the left cropping bound, but without the second (gutter) condition. For images containing two pages, the bound detection is performed similarly.

For single page documents the final cropping is performed bounds detected so far. For the double page case, the gutter detection algorithm presented in the following section is performed on an intermediary version of the input image, obtained by cropping the input document using the cropping bounds detected so far.

3. DOUBLE-PAGE GUTTER DETECTION

If the scanned document represents a double-page, it has to be split in order to crop the two pages individually. In order to do this, the gutter of the book/magazine from between the two distinct pages must be detected and the cropping must be performed at the left/right of the gutter.

Gutter detection is performed using the histogram of the sum of grayscale values of pixels for each individual column. This histogram is first processed using a triangle filter in order to remove some of the local minimum/maximum which is the result of randomness (for example a column with a very large sum of values compared to its neighbours, within a compact paragraph). As it is trivial that the gutter is somewhere around the center of the image, we will only consider the part of the histogram 20% percent left to 20% right of the middle of the histogram. For this area we will compute the set of local minima with values lower than the mid-histogram value (the arithmetic mean between the maximum and the minimum values of the entire histogram); one of them is to be used in order to detect the gutter.

Filtering the set of local minima (from the part of the histogram corresponding to the middle of the image's width) is done using the following criteria: for each local minimum a pair of numbers (GI1, GI2) called the gutter-interval is computed as follows:

* GI1 represents the index in the histogram of the first pixel left of the considered minimum, which has a histogram value higher than the mid-histogram value (the arithmetic mean between the maximum and the minimum values of the entire histogram)

* GI2 is similar to GI1, but computed at right of the local

From the set of distinct pairs (GI1, GI2), we choose one as the gutter by selecting the pair with the largest gutter score:

GS = 1/avVal * iLength (2)

[FIGURE 3 OMITTED]

where:

* avVal is the average histogram value from GI1 to GI2;

* iLength is GI2-GI1.

Finally, the two final distinct pages are used by cropping the document over the columns (0: GI1) for the left page and (GI2: imageWidth-1) for the right page.

4. CONCLUSION

In this paper we have shown an effective and practical method of cropping the print space from a digital image of a document. This algorithm is an essential pre-processing stage in the process of document digitization, and its success is critical for the content recognition and conversion methods that follow.

5. ACKNOWLEDGEMENTS

The research presented in this paper is supported by the national project "Excelenta in cercetare prin programe postdoctorale in domenii prioritare ale societatii bazate pe cunoastere (EXCEL)", Project POSDRU/89/1.5/S/62557.

6. REFERENCES

Breuel, T.M. (2003). "An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis", Proceedings of the Seventh International Conference on Document Analysis and Recognition, Vol. 1, pp. 66, ISBN 0-7695-1960-1, Scotland, August 2003, Edinburgh

Di Zenzo, S; Cinque, L. & Levialdi, S. (1996). "Run-Based Algorithms for Binary Image Analysis and Processing", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, No. 1, January 1996, pp. 83-89, ISSN: 0162-8828

Yacoub, S.; Saxena, V. & Sami, S. N. (2005). PerfectDoc: A Ground Truthing Environment for Complex Documents, Proceedings of International Conference on Document Analysis and Recognition, pp. 452-456, ISBN ISBN: 07695-2420-6, Seoul, August 2005

Zhang L. & Tan, C. L. (2005). Warped Image Restoration with Applications to Digital Libraries, Proceedings of International Conference on Document Analysis and Recognition, pp. 192-196, ISBN ISBN: 0-7695-2420-6, Seoul, August 2005

Zheng, Y; Li, H. & Doermann, D (2003). "A Model-based Line Detection Algorithm in Documents", Proceedings of the Seventh International Conference on Document Analysis and Recognition, Vol. 1, pp. 44-48, ISBN 0-7695-1960-1, Scotland, August 2003, Edinburgh