文章基本信息

标题：Seamless image-page alignment and rectangular areas removal.
作者：Boiangiu, Costin-Anton ; Spataru, Andrei-Cristian ; Petrescu, Serban 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Document digitization and content conversion is a field that encompasses many disciplines and tasks. For a digitization process to be complete and yield satisfactory results, a number of steps have to be performed. Generally, the first step is to subject the input image to a preprocessing phase, where noise reduction, filtering and skew correction methods are applied. The next step is the actual content extraction from the input image, also referred to as the hierarchical segmentation, where the layout of the page is detected and put into a logical context. This information is then used in combination with a powerful OCR engine in order to obtain the information contained in the image-page. Before offering output in a recognizable format, the digitization process has to post-process the data obtained so far through a number of methods, such as a dictionary check of the OCR results, or visual standardization of the input image, also called image beautification.
关键词：Image processing

Seamless image-page alignment and rectangular areas removal.

Boiangiu, Costin-Anton ; Spataru, Andrei-Cristian ; Petrescu, Serban 等

1. INTRODUCTION

Document digitization and content conversion is a field that encompasses many disciplines and tasks. For a digitization process to be complete and yield satisfactory results, a number of steps have to be performed. Generally, the first step is to subject the input image to a preprocessing phase, where noise reduction, filtering and skew correction methods are applied. The next step is the actual content extraction from the input image, also referred to as the hierarchical segmentation, where the layout of the page is detected and put into a logical context. This information is then used in combination with a powerful OCR engine in order to obtain the information contained in the image-page. Before offering output in a recognizable format, the digitization process has to post-process the data obtained so far through a number of methods, such as a dictionary check of the OCR results, or visual standardization of the input image, also called image beautification.

The algorithm presented in this paper is useful in two of the steps of the digitization process: preprocessing and post-processing.

2. ISSUES ADDRESSED BY THE ALGORITHM

The rectangular areas removal capability of the algorithm makes it suitable for noise or artifact elimination on portions of the image, while the seamless alignment capability helps in the post-processing stage, where an image that has been scanned incorrectly (e.g. rotated) needs to be restored.

A scanned image will also contain, beside the area of interest (the document page), some additional elements that may be part of the scanner, or of another origin. The area of interest has to be selected, either manually or automatically, and cropped out of the initial image. This operation results in a reduction of the image area.

When dealing with a digitization project, such as books for digital libraries (Baird, 2003), the resulting images have to respect a given standard for dimensions, translating into a new operation that has to be performed within the post-processing phase, the image-page alignment.

The denoising process is vital to the outcome of all subsequent operations and is highly dependent on the input (Pratt, 2001), but when dealing with larger areas that have to be removed from the image, a rectangular area removal tool should be used.

The following image portrays a scenario in which only part of the input image is needed, but keeping the output image at the same size as the input.

[FIGURE 1 OMITTED]

As it can be seen in Fig. 1, only the part of the image inside Rectangle (1) is needed, but the output has to remain at the same size as the rest of the pages in the collection, represented by Rectangle (2). The realization of this task is described in the following section.

3. ALGORITHM DESCRIPTION

The two distinct functionalities of the algorithm, image-page alignment and rectangular area removal, will be detailed separately in the next sections.

3.1 Seamless Image-Page Alignment

The image-page alignment algorithm is performed in a number of steps and benefits from two distinct approaches when performing the alignment, both having similar results in terms of output quality.

The first step in the algorithm is a color identification routine, meant to obtain information about the predominant background of the input image. The background may be "light", (the more common situation, when a document is written "black on white"), or "dark".

In order to make this decision, the algorithm finds all distinct colors in the image and takes into consideration only the most common ones. Then a threshold is applied on this set of common colors to separate them into light and dark ones. The threshold is set at the middle of the grayscale (50% gray); all values above the threshold are considered light, while the values below are considered dark.

At this point the algorithm makes the decision whether the document image is black on white (light background and dark foreground) or vice versa. This decision is made by simply choosing the set of colors with the most members. The set of chosen background colors will help reconstruct the areas outside the input image by maintaining a homogeneous and continuous hue in all regions.

There are two alternative sub-routines available for the next step in the algorithm, called "Random Dispersion" and "Filtered Interpolation", both being approaches for drawing an antialiased line. The difference in the approaches is in the way the colors for the antialiased line are obtained.

The image is reconstructed by adding these lines to the top, bottom, left and right areas of the input image, until the image reaches the required size. The pixels for these antialiased lines are chosen as to closely resemble the pixels from a nearby location in the input image.

When using the "Random Dispersion" approach, the following steps are performed: a pixel P is taken from the initial image and compared to the set of background colors obtained in the previous step of the algorithm. The comparison is not made in RGB, but is computed as an LUV color space distance between the operands. The reason for using the LUV (or CIELUV) color space is that the computed distances between colors in LUV are similar to the differences perceived by the human eye (Fairchild, 1998). In other words, the colors are compared based on how similar they appear to the observer, realizing a seamless extension of the initial image. The closest background pixel is chosen and put on the first constructed line. The procedure is repeated until the pixels on the entire side of the image have been used.

For this approach, the algorithm uses a dispersion area when choosing the pixel from previously constructed rows or from the input image, so that with each added line the colors are randomly dispersed, giving a realistic appearance to the new image sections.

[FIGURE 2 OMITTED]

In the above figure rows of pixels are added to the right edge of the input area, along the direction of growth (indicated by the arrows). This is done by randomly taking a P pixel from the dispersion area, comparing it to the background colors, choosing the closest background color in the LUV space and placing it in the point P'. As rows are added, the dispersion area grows equally on the vertical and horizontal axes. The procedure is repeated until the entire target area is filled with rows, in all directions.

When using the "Filtered Interpolation" approach, the algorithm will obtain new rows of pixels by applying a filter of a certain size to the previous row, in every direction. There is a wide variety of filters that can be used (e.g. Box, Hermite, Triangle, etc.) (Vaseghi, 2000). The pixel value obtained from the filter is then compared, again in the LUV color space, to the background colors detected at step one, and the closest background color is chosen. Because this time there is no random factor to realistically extend the image, a Perlin noise function is applied to the chosen background color. The Perlin noise function creates a pseudo-random appearance (similar to a cloud texture), having a great advantage over random noise in terms of realism (Perlin, 2002).

After all rows have been generated, a final randomization factor is added, in the form of a number of random permutations done between adjacent pixels that have been generated in the previous steps. This operation consists of obtaining three random values, for the X and Y coordinates of the target pixel and for the permutation direction, and switching the pixel at the (X, Y) coordinates with the adjacent pixel in the random direction provided.

3.2 Rectangular Area Removal

As in the image page alignment algorithm, the set of predominant background colors is obtained. The pixels forming the boundary of the given rectangular area will be used to generate an inner boundary, by comparing each pixel to the set of background colors and choosing the closest match in the LUV color space.

Using this newly generated boundary, every pixel (inside the boundary) will receive a new RGB value based on a weighted average with respect to the boundary pixels. This color is then converted into the LUV space and matched against members of the background color set, choosing the closest match. Then, as in the "Filtered Interpolation" approach, Perlin noise is added to the pixel. The sequence is repeated for every pixel contained inside the rectangular boundary and, as a final step, a number of random permutations are performed.

3.3 Results

The figure above shows the image results of the algorithms. It can be seen that by using only colors from the background to generate new rows of pixels, the dark regions (foreground or noise) right on the edge of the input area are propagated as little as possible through the generated lines.

[FIGURE 3 OMITTED]

4. CONCLUSIONS

The methods presented in this paper are useful for many content conversion operations, from preprocessing to post-processing. The seamless image-page alignment algorithm represents an efficient solution to the problem of output standardization, and makes use of some novel techniques, such as the Perlin noise function, for achieving a realistic output image. A modification of the algorithm provides a solution to a different problem, removing rectangular sections from the image, especially useful when cleaning noises in the page.

5. REFERENCES

Baird, H.S. (2003). "Digital Libraries and Document Image Analysis", Proceedings of the Seventh International Conference on Document Analysis and Recognition, Vol. 1, pp. 2-15, ISBN 0-7695-1960-1, Scotland, August 2003, Edinburgh

Fairchild, M. D. (1998). Color Appearance Models. Addison-Wesley, ISBN 0-201-63464-3, Reading, MA, USA

Perlin, K. (2002). "Improving Noise", Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pp. 681-682, ISBN 0730-0301, USA, 2002, San Antonio, Texas

Pratt, W. K. (2001). Digital Image Processing: PIKS Inside, Third Edition, John Wiley & Sons, Inc., ISBN 0-47122132-5, New York, NY, USA

Vaseghi, S. V. (2000). Advanced Digital Signal Processing and Noise Reduction, Second Edition, John Wiley & Sons Ltd., ISBN 0-470-84162-1, New York, NY, USA