Seamless image-page alignment and rectangular areas removal.
Boiangiu, Costin-Anton ; Spataru, Andrei-Cristian ; Petrescu, Serban 等
1. INTRODUCTION
Document digitization and content conversion is a field that
encompasses many disciplines and tasks. For a digitization process to be
complete and yield satisfactory results, a number of steps have to be
performed. Generally, the first step is to subject the input image to a
preprocessing phase, where noise reduction, filtering and skew correction methods are applied. The next step is the actual content
extraction from the input image, also referred to as the hierarchical
segmentation, where the layout of the page is detected and put into a
logical context. This information is then used in combination with a
powerful OCR engine in order to obtain the information contained in the
image-page. Before offering output in a recognizable format, the
digitization process has to post-process the data obtained so far
through a number of methods, such as a dictionary check of the OCR
results, or visual standardization of the input image, also called image
beautification.
The algorithm presented in this paper is useful in two of the steps
of the digitization process: preprocessing and post-processing.
2. ISSUES ADDRESSED BY THE ALGORITHM
The rectangular areas removal capability of the algorithm makes it
suitable for noise or artifact elimination on portions of the image,
while the seamless alignment capability helps in the post-processing
stage, where an image that has been scanned incorrectly (e.g. rotated)
needs to be restored.
A scanned image will also contain, beside the area of interest (the
document page), some additional elements that may be part of the
scanner, or of another origin. The area of interest has to be selected,
either manually or automatically, and cropped out of the initial image.
This operation results in a reduction of the image area.
When dealing with a digitization project, such as books for digital
libraries (Baird, 2003), the resulting images have to respect a given
standard for dimensions, translating into a new operation that has to be
performed within the post-processing phase, the image-page alignment.
The denoising process is vital to the outcome of all subsequent
operations and is highly dependent on the input (Pratt, 2001), but when
dealing with larger areas that have to be removed from the image, a
rectangular area removal tool should be used.
The following image portrays a scenario in which only part of the
input image is needed, but keeping the output image at the same size as
the input.
[FIGURE 1 OMITTED]
As it can be seen in Fig. 1, only the part of the image inside
Rectangle (1) is needed, but the output has to remain at the same size
as the rest of the pages in the collection, represented by Rectangle
(2). The realization of this task is described in the following section.
3. ALGORITHM DESCRIPTION
The two distinct functionalities of the algorithm, image-page
alignment and rectangular area removal, will be detailed separately in
the next sections.
3.1 Seamless Image-Page Alignment
The image-page alignment algorithm is performed in a number of
steps and benefits from two distinct approaches when performing the
alignment, both having similar results in terms of output quality.
The first step in the algorithm is a color identification routine,
meant to obtain information about the predominant background of the
input image. The background may be "light", (the more common
situation, when a document is written "black on white"), or
"dark".
In order to make this decision, the algorithm finds all distinct
colors in the image and takes into consideration only the most common
ones. Then a threshold is applied on this set of common colors to
separate them into light and dark ones. The threshold is set at the
middle of the grayscale (50% gray); all values above the threshold are
considered light, while the values below are considered dark.
At this point the algorithm makes the decision whether the document
image is black on white (light background and dark foreground) or vice
versa. This decision is made by simply choosing the set of colors with
the most members. The set of chosen background colors will help
reconstruct the areas outside the input image by maintaining a
homogeneous and continuous hue in all regions.
There are two alternative sub-routines available for the next step
in the algorithm, called "Random Dispersion" and
"Filtered Interpolation", both being approaches for drawing an
antialiased line. The difference in the approaches is in the way the
colors for the antialiased line are obtained.
The image is reconstructed by adding these lines to the top,
bottom, left and right areas of the input image, until the image reaches
the required size. The pixels for these antialiased lines are chosen as
to closely resemble the pixels from a nearby location in the input
image.
When using the "Random Dispersion" approach, the
following steps are performed: a pixel P is taken from the initial image
and compared to the set of background colors obtained in the previous
step of the algorithm. The comparison is not made in RGB, but is
computed as an LUV color space distance between the operands. The reason
for using the LUV (or CIELUV) color space is that the computed distances
between colors in LUV are similar to the differences perceived by the
human eye (Fairchild, 1998). In other words, the colors are compared
based on how similar they appear to the observer, realizing a seamless
extension of the initial image. The closest background pixel is chosen
and put on the first constructed line. The procedure is repeated until
the pixels on the entire side of the image have been used.
For this approach, the algorithm uses a dispersion area when
choosing the pixel from previously constructed rows or from the input
image, so that with each added line the colors are randomly dispersed,
giving a realistic appearance to the new image sections.
[FIGURE 2 OMITTED]
In the above figure rows of pixels are added to the right edge of
the input area, along the direction of growth (indicated by the arrows).
This is done by randomly taking a P pixel from the dispersion area,
comparing it to the background colors, choosing the closest background
color in the LUV space and placing it in the point P'. As rows are
added, the dispersion area grows equally on the vertical and horizontal
axes. The procedure is repeated until the entire target area is filled
with rows, in all directions.
When using the "Filtered Interpolation" approach, the
algorithm will obtain new rows of pixels by applying a filter of a
certain size to the previous row, in every direction. There is a wide
variety of filters that can be used (e.g. Box, Hermite, Triangle, etc.)
(Vaseghi, 2000). The pixel value obtained from the filter is then
compared, again in the LUV color space, to the background colors
detected at step one, and the closest background color is chosen.
Because this time there is no random factor to realistically extend the
image, a Perlin noise function is applied to the chosen background
color. The Perlin noise function creates a pseudo-random appearance
(similar to a cloud texture), having a great advantage over random noise
in terms of realism (Perlin, 2002).
After all rows have been generated, a final randomization factor is
added, in the form of a number of random permutations done between
adjacent pixels that have been generated in the previous steps. This
operation consists of obtaining three random values, for the X and Y
coordinates of the target pixel and for the permutation direction, and
switching the pixel at the (X, Y) coordinates with the adjacent pixel in
the random direction provided.
3.2 Rectangular Area Removal
As in the image page alignment algorithm, the set of predominant
background colors is obtained. The pixels forming the boundary of the
given rectangular area will be used to generate an inner boundary, by
comparing each pixel to the set of background colors and choosing the
closest match in the LUV color space.
Using this newly generated boundary, every pixel (inside the
boundary) will receive a new RGB value based on a weighted average with
respect to the boundary pixels. This color is then converted into the
LUV space and matched against members of the background color set,
choosing the closest match. Then, as in the "Filtered
Interpolation" approach, Perlin noise is added to the pixel. The
sequence is repeated for every pixel contained inside the rectangular
boundary and, as a final step, a number of random permutations are
performed.
3.3 Results
The figure above shows the image results of the algorithms. It can
be seen that by using only colors from the background to generate new
rows of pixels, the dark regions (foreground or noise) right on the edge
of the input area are propagated as little as possible through the
generated lines.
[FIGURE 3 OMITTED]
4. CONCLUSIONS
The methods presented in this paper are useful for many content
conversion operations, from preprocessing to post-processing. The
seamless image-page alignment algorithm represents an efficient solution
to the problem of output standardization, and makes use of some novel
techniques, such as the Perlin noise function, for achieving a realistic
output image. A modification of the algorithm provides a solution to a
different problem, removing rectangular sections from the image,
especially useful when cleaning noises in the page.
5. REFERENCES
Baird, H.S. (2003). "Digital Libraries and Document Image
Analysis", Proceedings of the Seventh International Conference on
Document Analysis and Recognition, Vol. 1, pp. 2-15, ISBN 0-7695-1960-1,
Scotland, August 2003, Edinburgh
Fairchild, M. D. (1998). Color Appearance Models. Addison-Wesley,
ISBN 0-201-63464-3, Reading, MA, USA
Perlin, K. (2002). "Improving Noise", Proceedings of the
29th annual conference on Computer graphics and interactive techniques,
pp. 681-682, ISBN 0730-0301, USA, 2002, San Antonio, Texas
Pratt, W. K. (2001). Digital Image Processing: PIKS Inside, Third
Edition, John Wiley & Sons, Inc., ISBN 0-47122132-5, New York, NY,
USA
Vaseghi, S. V. (2000). Advanced Digital Signal Processing and Noise
Reduction, Second Edition, John Wiley & Sons Ltd., ISBN
0-470-84162-1, New York, NY, USA