期刊名称:International Journal of Computer Trends and Technology
电子版ISSN:2231-2803
出版年度:2014
卷号:17
期号:3
页码:125-132
DOI:10.14445/22312803/IJCTT-V17P124
出版社:Seventh Sense Research Group
摘要:Information extraction is nothing but taking out the structured information from online databases automatically. The major intent of the information extraction process is to extract accurate and correct text portion of documents. Web includes a numerous list of objects like conference programs and comment lists in blogs. From the web, extraction of list of objects is done by utilizing record extraction which discovers a set of Web page segments. To take out data records, a new method called Tag path Clustering is suggested. This method captures a list of objects in a more vigorous way based on a holistic analysis of a Web page. The main focus of this method is how a dissimilar tag path appears continually in the document. A pair of tag path occurrence patterns called visual signals is compared to compute how likely these two tag paths signify the same list of objects. After that, by using a similarity measure which captures how intimately the tag paths emerge and intersperse .Based on the similarity measure clustering of tag paths are employed to extract sets of tag paths that form the structure of the data records. A Bayesian learning framework is proposed to find new data attributes for adapting the information extraction, knowledge formerly learned from a source Web site to a new unseen site and also finding earlier unseen attributes. Expectation maximization improved Bayesian learning techniques are utilized for finding new training data for learning the new wrapper for new unseen sites. This method effectually extracts attributes from the new unseen Web site. Experimental results show that this framework achieves a very promising performance.
关键词:Information extraction; data record extraction; clustering; Wrapper adaptation