文章基本信息

标题：White-space detection techniques based on neighbourhood distance measurements.
作者：Boiangiu, Costin Anton ; Cananau, Dan Cristian ; Dvornic, Andrei Iulian 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2008
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：The idea of processing an electronic document layout is to determine the areas of interest, such as text areas, pictures, tables or many other collections of entities gathered as one based on simple mathematical conditions or characteristics (Di Zenzo et al., 1996).

White-space detection techniques based on neighbourhood distance measurements.

Boiangiu, Costin Anton ; Cananau, Dan Cristian ; Dvornic, Andrei Iulian 等

1. INTRODUCTION

The idea of processing an electronic document layout is to determine the areas of interest, such as text areas, pictures, tables or many other collections of entities gathered as one based on simple mathematical conditions or characteristics (Di Zenzo et al., 1996).

In this paper four algorithms are provided for solving the detection problem. All four are based on measurements of various distances (Breuel, 2003) in the document. The most important thing provided by the algorithm is the fact that is independent of the position in page.

2. THE SOLUTION

2.1 How to find the separators--A font size approach

First of all the white space separators surround the elements of the page such as text areas or pictures. These contain white spaces inside them, but they cannot be regarded as separators and a selection must be made (Saitoh et al., 1994).

This is the starting point for the first algorithm. It centres each pixel inside a square with a given size and checks for the percentage of white pixels inside the selected geometric figure. Then, if the percentage is close to 100% it means that the pixel is inside a white space, which will be surely a separator for letters, lines or areas. This approach will determine the spaces between lines and areas, but the most important thing is to determine the correct size for the square in order to remove undesired output. An example of the algorithm applied on a newspaper page can be seen in figure 2. As it can be seen, all the white spaces were found, using a search square of a fixed size.

[FIGURE 1 OMITTED]

So, it is clear that a better measure has to be found. First of all, the selection of the output has to satisfy a number of criteria. It needs to eliminate the space between letters and text lines and the most appropriate measure for this task is the font size (Mann, 2002).

In order to have a good result, the algorithm has to be applied either on a document with the same font size on the whole page, or to apply it on areas. The results can be seen in Fig. 1.

2.2 How to find the separators--The distances approach

Distance map:

* A distance value for each pixel;

* If p is a white pixel, the distance value equals the number of pixels between the closest two black pixels on a specified direction;

* If p is a black pixel, the distance value is 0;

Types of distance maps:

* Horizontal distance map (the horizontal direction is considered when searching for the closest two black pixels);

* Vertical distance map (the vertical direction is considered when searching for the closest two black pixels);

* Arbitrary direction distance map;

Combining two or more distance maps:

* Very restrictive:

** Final map = min(map0, map1, mapN);

* Less restrictive:

** Final map = max(map0, map1, ..., mapN);

* Advanced:

** Final map = max(alpha0 * map0, alpha1 * map1, ... , alphaN * mapN);

For what the final map is used:

* Identify a page main fragments

* Progressively lowering a threshold value, identify for each main fragment its sub-fragments

* Recursively continue this procedure until the smallest convenient fragment has been identified

How the final map is used:

* Determine a binary image with separators pixels and non-separator pixels (separator pixels will be those pixels with values above a certain threshold in the distance map; non-separator pixels will be those pixels with values below a certain threshold in the distance map).

* A set of connected pixels surrounded by page edges or separator pixels will form a fragment.

[FIGURE 2 OMITTED]

2.3. How to find the separators--Using a crosshair to improve the distance approach

As it has been presented above, the distance algorithm has a high rate of success, but it can be improved. A better approach towards the determination of white space separators has to take into consideration the orientation together with the horizontal and vertical distance between the centred white pixel and the first new black one (crosshair), found on the corresponding direction.

The algorithm computes the distances on all axes from the Cartesian system, rotated with 5 degrees to the left and 5 degrees to the right. In this manner, the white spaces will be found even if the page is not perfectly centred and has a small deflection from the normal orientation.

After computing the distances for each pixel, the maximum corresponding to both axes in an orientation is selected and assigned to the corresponding pixel. In the next step a threshold is applied successively to eliminate the white spaces found between letters and text rows. The maximum stored for each pixel is matched with the threshold and the selection is made. The final result and the steps towards achieving this result can be seen in Fig. 3.

2.4. How to find the separators--Another distance approach

The distance approach to white separator detection refers to the usage of a run-length routine that checks distances between white and black pixels on the input image (Bovik, 2000).

Step 1: The algorithm finds, for every white pixel in the image, the distances to the first black pixel on four directions, top, bottom, left and right, and inserts these values into a matrix.

Then, three threshold values are computed, adjusted to the dimensions of the input page and the distances to the nearest black pixels. These thresholds will decide if pixels will be united horizontally and/or vertically.

h = 1 - top_distance + bottom_distance/height + width (1)

w = 1 - left_distance + right_distance/height + width (2)

sum = w + h (3)

[FIGURE 3 OMITTED]

[FIGURE 4a OMITTED]

[FIGURE 4b OMITTED]

[FIGURE 4c OMITTED]

In relations (1) and (2), the formulas for the horizontal and vertical thresholds are presented, where top_distance and bottomdistance, leftdistance and rightdistance are the distances to the nearest black pixel in the respective direction, while height and width are the dimensions of the input image. Relation (3) presents the third threshold, as being the sum of the horizontal and vertical components. Based on these thresholds, the algorithm connects the black pixels in the image, thus completing the first step.

Step 2: As it can be observed in Fig. 4b, the first step of the algorithm results in an image that connects homogeneous text zones, but the zones are not totally covered (white patches within the paragraph areas may still appear). In the second step, these white patches are detected and covered (integrated into their respective paragraphs). This is done by finding the number of passes from white to black, for every white pixel, on all four directions. It is obvious that if a white pixel is surrounded by black ones in a paragraph of text, the number of passes from white to black has to be even.

Step 3: At this point, the text and white space separators are clearly distinguished, and all that remains to be done is to obtain the coordinates of these areas. This is done by applying any contour detection algorithm, and so, the end-result will be the coordinates of the paragraphs and text areas.

3. CONCLUSIONS

All of these algorithms have in a common a very useful characteristic of white spaces: the size of the space on a given axis. By taking into account this value and combining it with simple mathematical concepts, the results are a starting point for further improvements in document processing.

4. REFERENCES

Bovik, A. (2000). Handbook of Video and Image Processing, Academic Press, ISBN-10: 0-12-119792-1, New York, United States of America

Breuel, T.M. (2003). An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis, Proceedings of the Seventh International Conference on Document Analysis and Recognition, Vol. 1, pp. 66, ISBN 0-7695-1960-1, Scotland, August 2003, Edinburgh

Di Zenzo, S; Cinque, L. & Levialdi, S. (1996). Run-Based Algorithms for Binary Image Analysis and Processing, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, No. 1, (January 1996) pp. 83-89, ISSN: 0162-8828.

Mann, S. (2002). Intelligent Image Processing, John Wiley & Sons, ISBN: 9780471221630, United States of America

Saitoh, T; Yamaai, T & Tachikawa, M. (1994). Document Image Segmentation and Layout Analysis, IEICE Trans. Information and Systems, Vol. 77, No. 7, (March 1994) pp. 778-784, ISSN: 0916-8532