期刊名称:International Journal of Grid and Distributed Computing
印刷版ISSN:2005-4262
出版年度:2009
卷号:2
期号:3
出版社:SERSC
摘要:This paper is meant for an easy approach for XML ifying of crude corpus in the field of Opinion Mining. The XMLification is done based on regular expressions. Corpus is the plural form of ‘corpora’. It is nothing but the collection of linguistic data. In this proposed work, the corpus is reviews posted on web sites; more specifically some product reviews. The reviews or the opinions are in the html files which are collected from sites like Cnet.com, Epinions.com, Amazon.com, ebay.com etc. After getting the crude corpus of html files, it is polished further to get only the required part of review details from that web page and thus removes the rest. This corpus is processed again and yields ultimate output in the form of XML files which contains only the important parts of the review details from raw html page. These XML files are ready to be used for further steps of Opinion Mining like parts of Speech(POS) tagging or any kind of language processes for machine learning process..
关键词:Crude corpus; language processing; regular expression; XML; parts of speech tagging.