摘要:In data-intensive web sites pages are generated by scripts that embed data from
a backend database into HTML templates. There is usually a relationship between
the semantics of the data in a page and its corresponding template. For example,
in a web site about sports events, it is likely that pages with data about
athletes are associated with a template that differs from the template used to
generate pages about coaches or referees. This article presents a method to
classify web pages according to the associated template. Given a web page, the
goal of our method is to accurately find the pages that are about the same
topic. Our method leverages on a simple, yet effective model to abstract some
structural features of a web page. We present the results of an extensive
experimental analysis that show the performance of our methods in terms of both
recall and precision regarding a large number of real-world web pages.
关键词:clustering, data extraction, web page classification