Printing space recognition in document digitization projects.
Boiangiu, Costin-Anton ; Dvornic, Andrei-Iulian ; Petrescu, Serban 等
1. INTRODUCTION
Print-space recognition plays a major role in high volume document
digitization projects. The techniques used for detecting relevant areas
of information from scanned images are critical both from the
qualitative point of view, as well as the overall time-performance and
costs of the digitization process (Le Bourgeois et al., 2004). The
capture of digital data varies throughout the industry from black and
white or grayscale flatbed scanning to photographic and digital imaging
producing high resolution digital pictures. The first method is usually
popular in projects involving disbound books or single-leaf documents
like monographs, journals, newspapers or manuscript pages, whereas the
continuous-tone true color image capturing is used mostly for pictures,
books which cannot be disbound, papyrus and archaeological artifacts (Yacoub et al., 2005). This paper takes a systematic view upon print
space recognition and presents a new technique which tries to solve this
issue and is suitable for most types of input documents. However, this
research field of the printing space recognition area, was regarded as
mainly a front-end problem and also a little too fuzzy to be addressed
in an algorithmically manner due to the huge diversity in document
appearance. None of the approaches are practical enough for modern,
fully-automatic scanners, because it was regarded that the scanner
vendor software will be used to cover this aspect. There is a problem,
however: the scanner software is not fully integrated into a content
conversion automated pipeline and it may prove difficult (if not
impossible) to programmatically control the internal parameters of this
third-party software.
2. OVERVIEW OF THE ALGORITHM
For years, the only solution that could reduce the budget of the
document scanning stage of a digitization project was the cutting off of
the bindings of books and magazines. This was followed by feeding the
individual papers to an automatic document feeder. While this was
definitely not a desirable solution for very old and uncommon books, it
remains a useful tool for book and magazine scanning where the
replacement of the scanned content is easy and inexpensive (Zhang &
Tan, 2005; Simske & Lin, 2004). However, in recent years, most
companies have turned to software driven machines and robots developed
to automatically scan books without the need of disbinding them (Kirtas,
ATIZ, ScanRobot SR301, etc.). These equipments allow both the contents
of the document and a digital photo archive of its current state to be
preserved. High quality digital archive images are being captured in a
short time with no damage to the document (Thoma & Ford, 2002).
[FIGURE 1 OMITTED]
The proposed algorithm is trying to achieve print space recognition
of documents obtained using automatic scanning machines, by taking into
consideration the particularities of such devices.
2.1 Deletion of scanner details
A typical automatic scanner consists of high quality digital camera
with light sources on either side of it, mounted on some sort of frame,
in order to provide easy access for a person or machine to flip the
pages of the book. Some models involve V-shaped book cradles, which
provide support for book spines and also canter book position
automatically. This kind of configuration results in 24BPP images that
contain both the contents of the target document, as well as some
additional details of the device. Hence, the first step of the algorithm
is aimed at removing most of this extra information, which is basically
noise and can interfere with the print space recognition algorithm.
The sum of the grayscale values of all pixels is computed for each
individual column and row of the input image. Then, two different
thresholds are used in order to filter out the mechanical details as
follows:
* all lines having the sum of grayscale values lower than 60% from
the maximum sum of grayscale values for an individual line are removed;
* all columns having the sum of grayscale values lower than 30%
from the maximum sum of grayscale values for an individual column are
removed.
The removal of the lines and columns which fulfill the above
described conditions is performed by painting them as whole black in the
original image. The two threshold parameters (30% and 60%) are the
result of tests on more than 500 different book pages, obtained via
Kirtas scanner.
2.2 Cost computation
Using the image obtained after the transformation from the previous
step, an individual score (cost) is computed for each individual line
and column in the document. Again, the sum of the grayscale values is
considered, but this time the sum is altered by adding an extra 0,78%
for each pixel whose grayscale value is greater than 0.39% (which means
that it quite safe to say that it probably belongs to the background).
This addition to the initial score is performed in order to increase the
score gap between scan lines/columns which intersect the text, and those
which do not contain any foreground data. Without this computations, two
different scan lines, one intersecting text, and one not (but having a
darker background) might have had the same score and this fact could
lead to serious errors in page space recognition. Due to their design
and position of the light sources, most of the automatic scanning
devices generate images in which the background of the documents is not
uniform. Apart from that, in the case of old books and newspapers, the
original physical background is not uniform at first; hence this
recalculation of lines'/columns' scores is a critical stage of
this algorithm.
The maximum score for lines, and respectively for columns, is
computed. Afterwards, the leftmost and rightmost columns of the picture
are investigated in order to detect whether the scanned page is an odd
or an even page of a book/magazine. An odd page is a page situated in
the right part as one would look at a book which is open, and
corresponds to a scanned image in which the right-most column score will
probably be 0. This is because the right-most part of the picture
contains only the scanning device's details, which should have been
eliminated in the first part of the algorithm. An even page is exactly
the opposite.
2.3 Horizontal bounds detection
The document is scanned from top to bottom line by line until a
line with the score higher than 30% of the maximum score of all lines is
detected. This line is set as the upper bound of the cropping rectangle.
The same procedure is repeated in order to detect the lower bound, by
scanning the image starting from the bottom this time. A correction of:
(0.005 * (bottom bound - upper bound)) (1)
is then added/subtracted from the two previously detected limits in
order to cope with slight inclinations of the document in the scanner.
2.4 Vertical bounds detection
In order to detect the left and the right cropping bound, two
search buffers are used depending on the type of page. If the page is an
odd one (as previously described) the left buffer is set to 10% of the
page width, whereas the right buffer is set to 5% of the page width. In
the case of an even page, the values of the search buffers are swapped.
A vertical scan line is used in order to pass through the document
starting from the right-most column. Once a column having the score
higher than 90% from the maximum score of all columns is found, this
column is set as the right cropping bound. Considering now the
previously set corresponding search buffer (right buffer), the scan
continues until the buffer is finished or either a totally black column
or an obstacle is found. A totally black column could be the result of
the removal of some columns representing the tab between the book's
pages (which occurs sometimes during the first step of the algorithm,
especially if the tab area is very dark). An obstacle is defined as a
column with a very low score, which is followed by columns with much
higher scores. If any of the two situations described above
(0-score-column/obstacle) is encountered, the right-bound is reset to
the index of that particular column.
The same process is then repeated correspondingly in order to
detect the left cropping bound. Finally, the actual cropping is
performed using the four detected limits.
3. EXPERIMENTAL RESULTS
The proposed algorithm has been tested successfully on a large
number of books. It can work on both grayscale and color input images
and its parameters can be slightly adjusted depending on the
particularities of the scanning device and/or the target document.
[FIGURE 2 OMITTED]
Further research will be aimed at improving the current algorithm
such that considering a document (book/magazine/newspaper) represented
by a series of scanned images (one for each page), the resulting
cropping frames to have the same dimension. The current implementation
generates a different-sized frame for each individual page.
Apart from that, the upper and lower clips used to hold in place a
page are also not removed. Improvements aimed at solving this particular
problem could be also taken into consideration. Although not very
difficult at first sight, this problem is quite tricky as many times
there is relevant information on pages at the same level with the
hold-in-place clips (for example page numbers) and this data must not be
lost during clip removal.
4. CONCLUSIONS
This paper presented a new algorithm for print space recognition
based on assigning a score for each individual line and column followed
by scan line use in order to detect the cropping bounds. In the research
leading to this algorithm, a very large number of tests have been
completed, and the proposed algorithm proved to be successful in more
than 89% of cases. As further researches we intend to extend and improve
the current algorithm in order to cope with the constantly-changing
developments in the document scanning industry.
5. REFERENCES
Le Bourgeois, F; Trinh, E.; Allier, B.; Eglin, V. & Emptoz, H.
(2004). Document Image Analysis Solutions for Digital Libraries,
Proceedings of First International Workshop on Document Image Analysis
for Libraries, pp. 2-24, ISBN 07695-2088-X, Palo Alto, January 2004
Simske, S. & Lin, X. (2004). Creating Digital Libraries:
Content Generation and Re-mastering, Proceedings of First International
Workshop on Document Image Analysis for Libraries, pp. 33-45, ISBN
0-7695-2088-X, Palo Alto, January 2004
Thoma, G. & Ford G. (2002). Automated Data Entry System:
Performance Issues, Proceedings of SPIE Conference on Document
Recognition and Retrieval IX, pp. 181-190, San Jose, January 2002
Yacoub, S.; Saxena, V. & Sami, S. N. (2005). PerfectDoc: A
Ground Truthing Environment for Complex Documents, Proceedings of
International Conference on Document Analysis and Recognition, pp.
452-456, ISBN ISBN: 07695-2420-6, Seoul, August 2005
Zhang L. & Tan, C. L. (2005). Warped Image Restoration with
Applications to Digital Libraries, Proceedings of International
Conference on Document Analysis and Recognition, pp. 192-196, ISBN ISBN:
0-7695-2420-6, Seoul, August 2005