期刊名称:International Journal of Advanced Research In Computer Science and Software Engineering
印刷版ISSN:2277-6451
电子版ISSN:2277-128X
出版年度:2012
卷号:2
期号:3
出版社:S.S. Mishra
摘要:In this paper studies the problem of extracting structured data from Web pages. The objective of the proposed research is to automatically extract data items/fields from records, and store the extracted data in a database. We formally define a template, and propo se a model that describes how values are encoded into pages using a template. For this purpose a new method to perform the task automatically. It consists of two steps, (1) automatically identify such data records in a page, and (2) automatically align and extract data items from the data records. In this paper we are using a partial tree alignment as a DOM tree in fivatech framework. Based on above two steps an unsupervised, page level data extraction approach is used to deduce schema and Template for each individual Deep Web site.
关键词:Data Record Extraction; Partial Tree Alignment; Wrapper; Web Data Extraction