期刊名称:International Journal of Electronics and Computer Science Engineering
电子版ISSN:2277-1956
出版年度:2012
卷号:1
期号:2
页码:550-556
出版社:Buldanshahr : IJECSE
摘要:Text Mining techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning contributed by the other term. Hence, the terms that capture the semantics of the text should be given more importance. Here, a new concept-based mining is introduced. It analyses the terms based on the sentence, document and corpus level. The model consists of sentence-based concept analysis which calculates the conceptual term frequency (ctf), document-based concept analysis which finds the term frequency (tf), corpus-based concept analysis which determines the document frequency (df) and concept-based similarity measure. The process of calculating ctf, tf, df, measures in a corpus is attained by the proposed algorithm which is called Concept-Based Analysis Algorithm. By doing so we cluster the web documents in an efficient way and the quality of the clusters achieved by this model significantly surpasses the traditional single-term-base approaches.