期刊名称:International Journal of Computer Technology and Applications
电子版ISSN:2229-6093
出版年度:2012
卷号:3
期号:1
页码:211-215
出版社:Technopark Publications
摘要:With the exponentially growing amount of information available on the Internet, an effective technique for users to discern the useful information from the unnecessary information is urgently required. Cleaning web pages for web data extraction becomes critical for improving performance of information retrieval and information extraction. So, we investigate to remove various noise patterns in Web pages instead of extracting relevant content from Web pages to get main content information. To solve this problem, we put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content. In this paper, we focus on removing noise and utilization of all kinds of content-characteristics, experiments show that this approach can enhance the universality and accuracy in extracting the body text of web pages.
关键词:information extraction; web page content extraction; removing noise content