摘要:Web is the biggest source of information and contains many entities and relationships between them, extracting these data from Massive Web pages and Integrating to a Semi-Structured Data with rich semantics will be more conducive to the management and use of these web data. On this premise, a comprehensive method is proposed to perform extraction the entities and relationships from the webpages. The method consists of two steps: 1) The target Web pages which contains these entities will be found based on the combination of vision information and content of keyword, meanwhile recording the relationship between father and children target Web pages; 2) Extracting the entities with analysis of DOM tree structure of the obtained Web pages and definitions of some extraction rules. At last, the extracted data is organized into a Semi-Structured Data with special relationships. Experiments on a large number of HTML pages have showed that this method can get a high correct rate and coverage.