文章基本信息

标题：Punjabi Documents Clustering System
本地全文：下载
作者：Sharma, Saurabh ; Gupta, Vishal
期刊名称：Journal of Emerging Technologies in Web Intelligence
印刷版ISSN：1798-0461
出版年度：2013
卷号：5
期号：2
页码：171-187
DOI：10.4304/jetwi.5.2.171-187
语种：English
出版社：Academy Publisher
摘要：Text document clustering inherits its qualities from Natural Languages Processing, Machine Learning and Information Retrieval. For unsupervised document organization, automatic topic extraction and fast information filtering and accuracy in retrieval, this is an effective method. Many clustering algorithms are available for unsupervised document organization and its retrieval thereof. The documents for text clustering are merely considered as an assortment of words in traditional approaches to clustering. The semantic relationship of the words should form the decisive base for clustering, which is generally conveniently forgotten albeit the information is vital for the purpose. A new method for generating frequent phrases by analyzing the semantic relations between the words in a sentence is discussed. Karaka list captures the semantic relations, which is a grammatical connector for connecting Nouns, Pronouns and Verbs in a sentence. This new clustering method utilizes an amalgamation of the theories behind Karaka Analyzer, Frequent Item sets and Frequent Word Sequences. Results are indicative of the fact that New Hybrid approach performs better in terms of Number of Clusters, Meaningful label of Clusters and effectiveness of clustering for those documents which do not have desired information in frequent phrases. Use of semantic features is the key to better results.
关键词：Punjabi Document Clustering;Karaka Theory;Frequent Phrases