文章基本信息

标题：Printing space recognition in document digitization projects.
作者：Boiangiu, Costin-Anton ; Dvornic, Andrei-Iulian ; Petrescu, Serban 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Print-space recognition plays a major role in high volume document digitization projects. The techniques used for detecting relevant areas of information from scanned images are critical both from the qualitative point of view, as well as the overall time-performance and costs of the digitization process (Le Bourgeois et al., 2004). The capture of digital data varies throughout the industry from black and white or grayscale flatbed scanning to photographic and digital imaging producing high resolution digital pictures. The first method is usually popular in projects involving disbound books or single-leaf documents like monographs, journals, newspapers or manuscript pages, whereas the continuous-tone true color image capturing is used mostly for pictures, books which cannot be disbound, papyrus and archaeological artifacts (Yacoub et al., 2005). This paper takes a systematic view upon print space recognition and presents a new technique which tries to solve this issue and is suitable for most types of input documents. However, this research field of the printing space recognition area, was regarded as mainly a front-end problem and also a little too fuzzy to be addressed in an algorithmically manner due to the huge diversity in document appearance. None of the approaches are practical enough for modern, fully-automatic scanners, because it was regarded that the scanner vendor software will be used to cover this aspect. There is a problem, however: the scanner software is not fully integrated into a content conversion automated pipeline and it may prove difficult (if not impossible) to programmatically control the internal parameters of this third-party software.
关键词：Document processing;Image processing;Printing

Printing space recognition in document digitization projects.

Boiangiu, Costin-Anton ; Dvornic, Andrei-Iulian ; Petrescu, Serban 等

1. INTRODUCTION

Print-space recognition plays a major role in high volume document digitization projects. The techniques used for detecting relevant areas of information from scanned images are critical both from the qualitative point of view, as well as the overall time-performance and costs of the digitization process (Le Bourgeois et al., 2004). The capture of digital data varies throughout the industry from black and white or grayscale flatbed scanning to photographic and digital imaging producing high resolution digital pictures. The first method is usually popular in projects involving disbound books or single-leaf documents like monographs, journals, newspapers or manuscript pages, whereas the continuous-tone true color image capturing is used mostly for pictures, books which cannot be disbound, papyrus and archaeological artifacts (Yacoub et al., 2005). This paper takes a systematic view upon print space recognition and presents a new technique which tries to solve this issue and is suitable for most types of input documents. However, this research field of the printing space recognition area, was regarded as mainly a front-end problem and also a little too fuzzy to be addressed in an algorithmically manner due to the huge diversity in document appearance. None of the approaches are practical enough for modern, fully-automatic scanners, because it was regarded that the scanner vendor software will be used to cover this aspect. There is a problem, however: the scanner software is not fully integrated into a content conversion automated pipeline and it may prove difficult (if not impossible) to programmatically control the internal parameters of this third-party software.

2. OVERVIEW OF THE ALGORITHM

For years, the only solution that could reduce the budget of the document scanning stage of a digitization project was the cutting off of the bindings of books and magazines. This was followed by feeding the individual papers to an automatic document feeder. While this was definitely not a desirable solution for very old and uncommon books, it remains a useful tool for book and magazine scanning where the replacement of the scanned content is easy and inexpensive (Zhang & Tan, 2005; Simske & Lin, 2004). However, in recent years, most companies have turned to software driven machines and robots developed to automatically scan books without the need of disbinding them (Kirtas, ATIZ, ScanRobot SR301, etc.). These equipments allow both the contents of the document and a digital photo archive of its current state to be preserved. High quality digital archive images are being captured in a short time with no damage to the document (Thoma & Ford, 2002).

[FIGURE 1 OMITTED]

The proposed algorithm is trying to achieve print space recognition of documents obtained using automatic scanning machines, by taking into consideration the particularities of such devices.

2.1 Deletion of scanner details

A typical automatic scanner consists of high quality digital camera with light sources on either side of it, mounted on some sort of frame, in order to provide easy access for a person or machine to flip the pages of the book. Some models involve V-shaped book cradles, which provide support for book spines and also canter book position automatically. This kind of configuration results in 24BPP images that contain both the contents of the target document, as well as some additional details of the device. Hence, the first step of the algorithm is aimed at removing most of this extra information, which is basically noise and can interfere with the print space recognition algorithm.

The sum of the grayscale values of all pixels is computed for each individual column and row of the input image. Then, two different thresholds are used in order to filter out the mechanical details as follows:

* all lines having the sum of grayscale values lower than 60% from the maximum sum of grayscale values for an individual line are removed;

* all columns having the sum of grayscale values lower than 30% from the maximum sum of grayscale values for an individual column are removed.

The removal of the lines and columns which fulfill the above described conditions is performed by painting them as whole black in the original image. The two threshold parameters (30% and 60%) are the result of tests on more than 500 different book pages, obtained via Kirtas scanner.

2.2 Cost computation

Using the image obtained after the transformation from the previous step, an individual score (cost) is computed for each individual line and column in the document. Again, the sum of the grayscale values is considered, but this time the sum is altered by adding an extra 0,78% for each pixel whose grayscale value is greater than 0.39% (which means that it quite safe to say that it probably belongs to the background). This addition to the initial score is performed in order to increase the score gap between scan lines/columns which intersect the text, and those which do not contain any foreground data. Without this computations, two different scan lines, one intersecting text, and one not (but having a darker background) might have had the same score and this fact could lead to serious errors in page space recognition. Due to their design and position of the light sources, most of the automatic scanning devices generate images in which the background of the documents is not uniform. Apart from that, in the case of old books and newspapers, the original physical background is not uniform at first; hence this recalculation of lines'/columns' scores is a critical stage of this algorithm.

The maximum score for lines, and respectively for columns, is computed. Afterwards, the leftmost and rightmost columns of the picture are investigated in order to detect whether the scanned page is an odd or an even page of a book/magazine. An odd page is a page situated in the right part as one would look at a book which is open, and corresponds to a scanned image in which the right-most column score will probably be 0. This is because the right-most part of the picture contains only the scanning device's details, which should have been eliminated in the first part of the algorithm. An even page is exactly the opposite.

2.3 Horizontal bounds detection

The document is scanned from top to bottom line by line until a line with the score higher than 30% of the maximum score of all lines is detected. This line is set as the upper bound of the cropping rectangle. The same procedure is repeated in order to detect the lower bound, by scanning the image starting from the bottom this time. A correction of:

(0.005 * (bottom bound - upper bound)) (1)

is then added/subtracted from the two previously detected limits in order to cope with slight inclinations of the document in the scanner.

2.4 Vertical bounds detection

In order to detect the left and the right cropping bound, two search buffers are used depending on the type of page. If the page is an odd one (as previously described) the left buffer is set to 10% of the page width, whereas the right buffer is set to 5% of the page width. In the case of an even page, the values of the search buffers are swapped. A vertical scan line is used in order to pass through the document starting from the right-most column. Once a column having the score higher than 90% from the maximum score of all columns is found, this column is set as the right cropping bound. Considering now the previously set corresponding search buffer (right buffer), the scan continues until the buffer is finished or either a totally black column or an obstacle is found. A totally black column could be the result of the removal of some columns representing the tab between the book's pages (which occurs sometimes during the first step of the algorithm, especially if the tab area is very dark). An obstacle is defined as a column with a very low score, which is followed by columns with much higher scores. If any of the two situations described above (0-score-column/obstacle) is encountered, the right-bound is reset to the index of that particular column.

The same process is then repeated correspondingly in order to detect the left cropping bound. Finally, the actual cropping is performed using the four detected limits.

3. EXPERIMENTAL RESULTS

The proposed algorithm has been tested successfully on a large number of books. It can work on both grayscale and color input images and its parameters can be slightly adjusted depending on the particularities of the scanning device and/or the target document.

[FIGURE 2 OMITTED]

Further research will be aimed at improving the current algorithm such that considering a document (book/magazine/newspaper) represented by a series of scanned images (one for each page), the resulting cropping frames to have the same dimension. The current implementation generates a different-sized frame for each individual page.

Apart from that, the upper and lower clips used to hold in place a page are also not removed. Improvements aimed at solving this particular problem could be also taken into consideration. Although not very difficult at first sight, this problem is quite tricky as many times there is relevant information on pages at the same level with the hold-in-place clips (for example page numbers) and this data must not be lost during clip removal.

4. CONCLUSIONS

This paper presented a new algorithm for print space recognition based on assigning a score for each individual line and column followed by scan line use in order to detect the cropping bounds. In the research leading to this algorithm, a very large number of tests have been completed, and the proposed algorithm proved to be successful in more than 89% of cases. As further researches we intend to extend and improve the current algorithm in order to cope with the constantly-changing developments in the document scanning industry.

5. REFERENCES

Le Bourgeois, F; Trinh, E.; Allier, B.; Eglin, V. & Emptoz, H. (2004). Document Image Analysis Solutions for Digital Libraries, Proceedings of First International Workshop on Document Image Analysis for Libraries, pp. 2-24, ISBN 07695-2088-X, Palo Alto, January 2004

Simske, S. & Lin, X. (2004). Creating Digital Libraries: Content Generation and Re-mastering, Proceedings of First International Workshop on Document Image Analysis for Libraries, pp. 33-45, ISBN 0-7695-2088-X, Palo Alto, January 2004

Thoma, G. & Ford G. (2002). Automated Data Entry System: Performance Issues, Proceedings of SPIE Conference on Document Recognition and Retrieval IX, pp. 181-190, San Jose, January 2002

Yacoub, S.; Saxena, V. & Sami, S. N. (2005). PerfectDoc: A Ground Truthing Environment for Complex Documents, Proceedings of International Conference on Document Analysis and Recognition, pp. 452-456, ISBN ISBN: 07695-2420-6, Seoul, August 2005

Zhang L. & Tan, C. L. (2005). Warped Image Restoration with Applications to Digital Libraries, Proceedings of International Conference on Document Analysis and Recognition, pp. 192-196, ISBN ISBN: 0-7695-2420-6, Seoul, August 2005