期刊名称:International Journal of Electronics and Computer Science Engineering
电子版ISSN:2277-1956
出版年度:2012
卷号:1
期号:3
页码:1292-1299
出版社:Buldanshahr : IJECSE
摘要:continuously strives to become the prime source of knowledge and Information, used in almost every sphere of life. As the volume and complexity of the Information shared on WEB is increasing, various forms of representation of this data has been emerged. In order to deal with different forms of data, different technologies have been discovered to efficiently provide the Information to the end users. With advent of such technologies the web content is reforming from simple HTML pages to highly complex, sophisticated bunch of data representation. A web page typically contains a mixture of many kind of information e.g. main contains, advertisements, navigational panels, copyright blocks etc. For a particular End User only part of information is useful and the rest could be regarded as noise. These all results into web applications which contain irrelevant and redundant Information, This can seriously harm web mining. The goal of this paper is to explore the use of formal methods for filtration of noise from web pages. Filtration of noise from web pages is a difficult task which in turn leads to difficulty in segmentation. . Various automatic techniques use various algorithms of segmentation, which are mainly based on web source code (HTML) including template based analysis. Our insight is to use the DOM structures of web documents to efficiently implement a technique to remove irrelevant data ,to optimize the WEB mining process .In this approach ,we firstly build the Semantic Tree to partition the web page into the content parts/elements based on the web page tags. The main focus is a need to develop a technique that keep common navigation structure as it is, but removes images, advertisement and improve surfing efficiency
关键词:Web Content Extractor; DOM tree; InnerHTML; outerHTML