White-space detection techniques based on neighbourhood distance measurements.
Boiangiu, Costin Anton ; Cananau, Dan Cristian ; Dvornic, Andrei Iulian 等
1. INTRODUCTION
The idea of processing an electronic document layout is to
determine the areas of interest, such as text areas, pictures, tables or
many other collections of entities gathered as one based on simple
mathematical conditions or characteristics (Di Zenzo et al., 1996).
In this paper four algorithms are provided for solving the
detection problem. All four are based on measurements of various
distances (Breuel, 2003) in the document. The most important thing
provided by the algorithm is the fact that is independent of the
position in page.
2. THE SOLUTION
2.1 How to find the separators--A font size approach
First of all the white space separators surround the elements of
the page such as text areas or pictures. These contain white spaces inside them, but they cannot be regarded as separators and a selection
must be made (Saitoh et al., 1994).
This is the starting point for the first algorithm. It centres each
pixel inside a square with a given size and checks for the percentage of
white pixels inside the selected geometric figure. Then, if the
percentage is close to 100% it means that the pixel is inside a white
space, which will be surely a separator for letters, lines or areas.
This approach will determine the spaces between lines and areas, but the
most important thing is to determine the correct size for the square in
order to remove undesired output. An example of the algorithm applied on
a newspaper page can be seen in figure 2. As it can be seen, all the
white spaces were found, using a search square of a fixed size.
[FIGURE 1 OMITTED]
So, it is clear that a better measure has to be found. First of
all, the selection of the output has to satisfy a number of criteria. It
needs to eliminate the space between letters and text lines and the most
appropriate measure for this task is the font size (Mann, 2002).
In order to have a good result, the algorithm has to be applied
either on a document with the same font size on the whole page, or to
apply it on areas. The results can be seen in Fig. 1.
2.2 How to find the separators--The distances approach
Distance map:
* A distance value for each pixel;
* If p is a white pixel, the distance value equals the number of
pixels between the closest two black pixels on a specified direction;
* If p is a black pixel, the distance value is 0;
Types of distance maps:
* Horizontal distance map (the horizontal direction is considered
when searching for the closest two black pixels);
* Vertical distance map (the vertical direction is considered when
searching for the closest two black pixels);
* Arbitrary direction distance map;
Combining two or more distance maps:
* Very restrictive:
** Final map = min(map0, map1, mapN);
* Less restrictive:
** Final map = max(map0, map1, ..., mapN);
* Advanced:
** Final map = max(alpha0 * map0, alpha1 * map1, ... , alphaN *
mapN);
For what the final map is used:
* Identify a page main fragments
* Progressively lowering a threshold value, identify for each main
fragment its sub-fragments
* Recursively continue this procedure until the smallest convenient
fragment has been identified
How the final map is used:
* Determine a binary image with separators pixels and non-separator
pixels (separator pixels will be those pixels with values above a
certain threshold in the distance map; non-separator pixels will be
those pixels with values below a certain threshold in the distance map).
* A set of connected pixels surrounded by page edges or separator
pixels will form a fragment.
[FIGURE 2 OMITTED]
2.3. How to find the separators--Using a crosshair to improve the
distance approach
As it has been presented above, the distance algorithm has a high
rate of success, but it can be improved. A better approach towards the
determination of white space separators has to take into consideration
the orientation together with the horizontal and vertical distance
between the centred white pixel and the first new black one (crosshair),
found on the corresponding direction.
The algorithm computes the distances on all axes from the Cartesian
system, rotated with 5 degrees to the left and 5 degrees to the right.
In this manner, the white spaces will be found even if the page is not
perfectly centred and has a small deflection from the normal
orientation.
After computing the distances for each pixel, the maximum
corresponding to both axes in an orientation is selected and assigned to
the corresponding pixel. In the next step a threshold is applied
successively to eliminate the white spaces found between letters and
text rows. The maximum stored for each pixel is matched with the
threshold and the selection is made. The final result and the steps
towards achieving this result can be seen in Fig. 3.
2.4. How to find the separators--Another distance approach
The distance approach to white separator detection refers to the
usage of a run-length routine that checks distances between white and
black pixels on the input image (Bovik, 2000).
Step 1: The algorithm finds, for every white pixel in the image,
the distances to the first black pixel on four directions, top, bottom,
left and right, and inserts these values into a matrix.
Then, three threshold values are computed, adjusted to the
dimensions of the input page and the distances to the nearest black
pixels. These thresholds will decide if pixels will be united
horizontally and/or vertically.
h = 1 - top_distance + bottom_distance/height + width (1)
w = 1 - left_distance + right_distance/height + width (2)
sum = w + h (3)
[FIGURE 3 OMITTED]
[FIGURE 4a OMITTED]
[FIGURE 4b OMITTED]
[FIGURE 4c OMITTED]
In relations (1) and (2), the formulas for the horizontal and
vertical thresholds are presented, where top_distance and
bottomdistance, leftdistance and rightdistance are the distances to the
nearest black pixel in the respective direction, while height and width
are the dimensions of the input image. Relation (3) presents the third
threshold, as being the sum of the horizontal and vertical components.
Based on these thresholds, the algorithm connects the black pixels in
the image, thus completing the first step.
Step 2: As it can be observed in Fig. 4b, the first step of the
algorithm results in an image that connects homogeneous text zones, but
the zones are not totally covered (white patches within the paragraph
areas may still appear). In the second step, these white patches are
detected and covered (integrated into their respective paragraphs). This
is done by finding the number of passes from white to black, for every
white pixel, on all four directions. It is obvious that if a white pixel
is surrounded by black ones in a paragraph of text, the number of passes
from white to black has to be even.
Step 3: At this point, the text and white space separators are
clearly distinguished, and all that remains to be done is to obtain the
coordinates of these areas. This is done by applying any contour detection algorithm, and so, the end-result will be the coordinates of
the paragraphs and text areas.
3. CONCLUSIONS
All of these algorithms have in a common a very useful
characteristic of white spaces: the size of the space on a given axis.
By taking into account this value and combining it with simple
mathematical concepts, the results are a starting point for further
improvements in document processing.
4. REFERENCES
Bovik, A. (2000). Handbook of Video and Image Processing, Academic
Press, ISBN-10: 0-12-119792-1, New York, United States of America
Breuel, T.M. (2003). An Algorithm for Finding Maximal Whitespace
Rectangles at Arbitrary Orientations for Document Layout Analysis,
Proceedings of the Seventh International Conference on Document Analysis
and Recognition, Vol. 1, pp. 66, ISBN 0-7695-1960-1, Scotland, August
2003, Edinburgh
Di Zenzo, S; Cinque, L. & Levialdi, S. (1996). Run-Based
Algorithms for Binary Image Analysis and Processing, IEEE Transactions
on Pattern Analysis and Machine Intelligence, Vol. 18, No. 1, (January
1996) pp. 83-89, ISSN: 0162-8828.
Mann, S. (2002). Intelligent Image Processing, John Wiley &
Sons, ISBN: 9780471221630, United States of America
Saitoh, T; Yamaai, T & Tachikawa, M. (1994). Document Image
Segmentation and Layout Analysis, IEICE Trans. Information and Systems,
Vol. 77, No. 7, (March 1994) pp. 778-784, ISSN: 0916-8532