Combined approaches in automatic page clustering for content conversion.
Boiangiu, Costin Anton ; Cananau, Dan Cristian
1. INTRODUCTION
The first step to determine the areas on a page (text areas,
pictures, tables) is to emphasize what separates them from each other.
And so, the problem that arises when attempting to detect separators is
that most algorithms are dependent on the geometry of the separators.
A large number of such algorithms detect the line separators by
using simple facts, such as a line has either the width or the height
greater as a value with respect to the other. Other algorithms use the
fact that lines are oriented on an axis, either vertical or horizontal
or even rotated with a certain angle.
There are also approaches that use more advanced mathematical
knowledge. An example can be seen in (Guru et al., 2004; Fortune, 1997),
where small eigenvalues analysis stands at the basis of the idea.
Delaunay triangulation for various text areas detection has been
presented in a number of papers, an example of such being (Xiao &
Yan, 2004), where it has been used for author and title regions
detection.
However, in this paper the solution to the already mentioned issue
of separators is an algorithm that is independent on geometry or shape
and uses only characteristics common to all separators.
2. THE SOLUTION
2.1 How to find the separators
The idea is the following: if we see each character, image or
collection of pixels as an entity we can connect each of them with the
neighboring entity by using the Delaunay triangulation. So, the entity
is defined as the collection of neighboring pixels of same color,
different from the background.
The purpose of this paper is not to present the Delaunay
triangulation algorithm, but to point out how the results from such a
mathematical approach can be used to further determine various output
for an entirely different domain. For further information on the
algorithm please check (Shewchuck, 2002).
First a set of vertexes has to be created for the algorithm to
work. In order to obtain this input the contour of each entity is
created. Every entity can be seen as a collection of horizontal segments
of pixels. The contour is actually made out of the pixels on the outside
border of the entity and the ones on the inside which can be seen from
the exterior. To sum up, the contour is given by the extremities of the
entity's segments.
[FIGURE 1 OMITTED]
Every entity can be seen as a collection of horizontal segments of
pixels. The contour is actually made out of the pixels on the outside
border of the entity and the ones on the inside which can be seen from
the exterior. To sum up, the contour is given by some of the extremities
of the segments contained in the entity.
Next, the constrained Delaunay algorithm is applied and some
triangles are obtained between entities as in the example in Fig. 1.
However, a filtering is required, because the purpose of this first
stage is to create connections between two entities.
The triangles that connect one or three entities are eliminated
(Fig. 2 shows an example of applied constrained Delaunay filtered as
mentioned above) and so, the input for the next step is created.
Most of the characters will be connected to up to eight entities
which are the two entities on the same row and three above and three
bellow, and so, if we compute the ratio between the connected points on
the current entity (further referred as current points) and the
connected points on all other entities connected by triangles (further
referred as target points) it will give us a relatively high result,
around one.
However, the separators do not generate such a value. The fact that
they are a special case of entities with respect to the letters or
symbols also influences the outcome of different approaches applied on
them. And so, their unique properties separate them from the symbols.
[FIGURE 2 OMITTED]
[FIGURE 3a OMITTED]
[FIGURE 3b OMITTED]
[FIGURE 3c OMITTED]
Because they extend to a few rows, are connected to far more
entities and have far more connectable Delaunay points, the number of
connected points on the separator will be significantly greater than the
number of connected points on the other entities connected by triangles
with the line.
2.2 How to find the zones
Another part of the solution is to find the zones based on a simple
resampling method. The initial image is taken and downsampled using a
triangle filter with a fixed very small window and the obtained result
is a blurred version of the initial image at a very small dimension
(Fig. 3a).
Next, the image is upsampled by using the same filter and brought
back to the initial size. By doing this operation, the size of the
picture was maintained, but the content inside has been changed, the new
image being a blurred version of the initial one (Fig. 3b).
The difference between the small version of the image after
applying the downsampling and the reconstructed image created by
upsampling is noticeable. Even though both of them seem to be a blurred
image of the initial, the first one is actually a compact version. At a
closer look on Fig. 3a the pixel can be seen entirely and their colors
can be noticed without any zooming effects (see Fig. 4).
The fact that the second image, Fig. 3b, is blurred is a result of
successively applying the filter. A simple transformation of the initial
image (see: Pratt, 2002; Mann, 2002; for image processing) would not
have given such a good effect and zones that should have been united
would have been left singular.
On the blurred image, a black and white transformation based on a
simple centre-thresholding technique is applied and so, the result from
Fig. 3c is obtained. The goal is to obtain white pixels in almost white
areas.
[FIGURE 4 OMITTED]
2.3. Correct zone detection
The black and white image provides an overview of the possible join
probabilities of different zones.
The effect of the resampling before the actual conversion is to
emphasize the white spaces with respect to the text or other elements,
from which they can be clearly differentiated. This approach favors the
large spaces that have a continuous distribution in the page.
[FIGURE 5a OMITTED]
[FIGURE 5b OMITTED]
This last concept of distribution does not refer to the spaces that
are bounded nor have very large dimensions on both axes between the
bounds.
As an example, in the obtained black and white image it can be
clearly observed that text from the right is not in the same zone as the
text from the middle or the picture and so, a difference is determined
(Fig. 5a). However, the picture is also clearly different from the text
in the middle. And here is where the Delaunay separator detection
algorithm appears.
If all the separators on a page are known then all the connections
to the separator are deleted and so the zones are differentiated. And in
this case the Delaunay algorithm determines the image as a separator,
because of the high ratio between source and destination points. The
final and correct result is displayed in Fig. 5b.
3. CONCLUSION
There have been many approaches used for page clustering, but the
presented algorithm brings a new view on this topic by starting with the
detection of the basic and also most important elements in a document
page: the separators. By using a combination of mathematical algorithms
and common knowledge on scanned document pages, this approach yields
good results with a high degree of certainty.
Because of its innovative character further improvements will be
made and it can open new research paths for the future.
4. REFERENCES
Fortune, S. (1997). Voronoi Diagrams and Delaunay triangulations,
In: Handbook of discrete and computational geometry, pp. 377-388, CRC Press, ISBN:0-8493-8524-5, Boca Raton, Florida
Guru, D. S.; Shekar, B. H. & Nagabhushan P. (2004). A simple
and robust line detection algorithm based on small eigenvalues analysis,
Pattern recognition Letters, Elsevier Science, Issue 1, Vol. 25,
(January 2004) pp. 1-13, ISSN: 0167-8655
Mann, S. (2002). Intelligent Image Processing, John Wiley &
Sons, ISBN: 9780471221630, United States of America
Pratt, W. K. (2002). Digital Image Processing, John Wiley &
Sons, ISBN: 0-471-85766-1, United States of America
Shewchuck, J. R. (2002). Delaunay refinement algorithms for
triangular mesh generation, Computational Geometry: Theory and
Applications, Vol. 22, No. 1, pp. 21-74, May 2002
Xiao Y. & Yan H. (2004). Location of title and author regions
in document images based on the Delaunay triangulation, Image and Vision
Computing, Vol. 2, No. 4, April 2004