文章基本信息

标题：Combined approaches in automatic page clustering for content conversion.
作者：Boiangiu, Costin Anton ; Cananau, Dan Cristian
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2008
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：The first step to determine the areas on a page (text areas, pictures, tables) is to emphasize what separates them from each other. And so, the problem that arises when attempting to detect separators is that most algorithms are dependent on the geometry of the separators.
关键词：Algorithms

Combined approaches in automatic page clustering for content conversion.

Boiangiu, Costin Anton ; Cananau, Dan Cristian

1. INTRODUCTION

The first step to determine the areas on a page (text areas, pictures, tables) is to emphasize what separates them from each other. And so, the problem that arises when attempting to detect separators is that most algorithms are dependent on the geometry of the separators.

A large number of such algorithms detect the line separators by using simple facts, such as a line has either the width or the height greater as a value with respect to the other. Other algorithms use the fact that lines are oriented on an axis, either vertical or horizontal or even rotated with a certain angle.

There are also approaches that use more advanced mathematical knowledge. An example can be seen in (Guru et al., 2004; Fortune, 1997), where small eigenvalues analysis stands at the basis of the idea.

Delaunay triangulation for various text areas detection has been presented in a number of papers, an example of such being (Xiao & Yan, 2004), where it has been used for author and title regions detection.

However, in this paper the solution to the already mentioned issue of separators is an algorithm that is independent on geometry or shape and uses only characteristics common to all separators.

2. THE SOLUTION

2.1 How to find the separators

The idea is the following: if we see each character, image or collection of pixels as an entity we can connect each of them with the neighboring entity by using the Delaunay triangulation. So, the entity is defined as the collection of neighboring pixels of same color, different from the background.

The purpose of this paper is not to present the Delaunay triangulation algorithm, but to point out how the results from such a mathematical approach can be used to further determine various output for an entirely different domain. For further information on the algorithm please check (Shewchuck, 2002).

First a set of vertexes has to be created for the algorithm to work. In order to obtain this input the contour of each entity is created. Every entity can be seen as a collection of horizontal segments of pixels. The contour is actually made out of the pixels on the outside border of the entity and the ones on the inside which can be seen from the exterior. To sum up, the contour is given by the extremities of the entity's segments.

[FIGURE 1 OMITTED]

Every entity can be seen as a collection of horizontal segments of pixels. The contour is actually made out of the pixels on the outside border of the entity and the ones on the inside which can be seen from the exterior. To sum up, the contour is given by some of the extremities of the segments contained in the entity.

Next, the constrained Delaunay algorithm is applied and some triangles are obtained between entities as in the example in Fig. 1.

However, a filtering is required, because the purpose of this first stage is to create connections between two entities.

The triangles that connect one or three entities are eliminated (Fig. 2 shows an example of applied constrained Delaunay filtered as mentioned above) and so, the input for the next step is created.

Most of the characters will be connected to up to eight entities which are the two entities on the same row and three above and three bellow, and so, if we compute the ratio between the connected points on the current entity (further referred as current points) and the connected points on all other entities connected by triangles (further referred as target points) it will give us a relatively high result, around one.

However, the separators do not generate such a value. The fact that they are a special case of entities with respect to the letters or symbols also influences the outcome of different approaches applied on them. And so, their unique properties separate them from the symbols.

[FIGURE 2 OMITTED]

[FIGURE 3a OMITTED]

[FIGURE 3b OMITTED]

[FIGURE 3c OMITTED]

Because they extend to a few rows, are connected to far more entities and have far more connectable Delaunay points, the number of connected points on the separator will be significantly greater than the number of connected points on the other entities connected by triangles with the line.

2.2 How to find the zones

Another part of the solution is to find the zones based on a simple resampling method. The initial image is taken and downsampled using a triangle filter with a fixed very small window and the obtained result is a blurred version of the initial image at a very small dimension (Fig. 3a).

Next, the image is upsampled by using the same filter and brought back to the initial size. By doing this operation, the size of the picture was maintained, but the content inside has been changed, the new image being a blurred version of the initial one (Fig. 3b).

The difference between the small version of the image after applying the downsampling and the reconstructed image created by upsampling is noticeable. Even though both of them seem to be a blurred image of the initial, the first one is actually a compact version. At a closer look on Fig. 3a the pixel can be seen entirely and their colors can be noticed without any zooming effects (see Fig. 4).

The fact that the second image, Fig. 3b, is blurred is a result of successively applying the filter. A simple transformation of the initial image (see: Pratt, 2002; Mann, 2002; for image processing) would not have given such a good effect and zones that should have been united would have been left singular.

On the blurred image, a black and white transformation based on a simple centre-thresholding technique is applied and so, the result from Fig. 3c is obtained. The goal is to obtain white pixels in almost white areas.

[FIGURE 4 OMITTED]

2.3. Correct zone detection

The black and white image provides an overview of the possible join probabilities of different zones.

The effect of the resampling before the actual conversion is to emphasize the white spaces with respect to the text or other elements, from which they can be clearly differentiated. This approach favors the large spaces that have a continuous distribution in the page.

[FIGURE 5a OMITTED]

[FIGURE 5b OMITTED]

This last concept of distribution does not refer to the spaces that are bounded nor have very large dimensions on both axes between the bounds.

As an example, in the obtained black and white image it can be clearly observed that text from the right is not in the same zone as the text from the middle or the picture and so, a difference is determined (Fig. 5a). However, the picture is also clearly different from the text in the middle. And here is where the Delaunay separator detection algorithm appears.

If all the separators on a page are known then all the connections to the separator are deleted and so the zones are differentiated. And in this case the Delaunay algorithm determines the image as a separator, because of the high ratio between source and destination points. The final and correct result is displayed in Fig. 5b.

3. CONCLUSION

There have been many approaches used for page clustering, but the presented algorithm brings a new view on this topic by starting with the detection of the basic and also most important elements in a document page: the separators. By using a combination of mathematical algorithms and common knowledge on scanned document pages, this approach yields good results with a high degree of certainty.

Because of its innovative character further improvements will be made and it can open new research paths for the future.

4. REFERENCES

Fortune, S. (1997). Voronoi Diagrams and Delaunay triangulations, In: Handbook of discrete and computational geometry, pp. 377-388, CRC Press, ISBN:0-8493-8524-5, Boca Raton, Florida

Guru, D. S.; Shekar, B. H. & Nagabhushan P. (2004). A simple and robust line detection algorithm based on small eigenvalues analysis, Pattern recognition Letters, Elsevier Science, Issue 1, Vol. 25, (January 2004) pp. 1-13, ISSN: 0167-8655

Mann, S. (2002). Intelligent Image Processing, John Wiley & Sons, ISBN: 9780471221630, United States of America

Pratt, W. K. (2002). Digital Image Processing, John Wiley & Sons, ISBN: 0-471-85766-1, United States of America

Shewchuck, J. R. (2002). Delaunay refinement algorithms for triangular mesh generation, Computational Geometry: Theory and Applications, Vol. 22, No. 1, pp. 21-74, May 2002

Xiao Y. & Yan H. (2004). Location of title and author regions in document images based on the Delaunay triangulation, Image and Vision Computing, Vol. 2, No. 4, April 2004