期刊名称:International Journal of Database Management Systems
印刷版ISSN:0975-5985
电子版ISSN:0975-5705
出版年度:2015
卷号:7
期号:2
页码:1
DOI:10.5121/ijdms.2015.7201
出版社:Academy & Industry Research Collaboration Center (AIRCC)
摘要:Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase inthe number of available documents. Nevertheless, the features that represent those documents are also toolarge. The most common method for representing documents is the vector space model, which representsdocument features as a bag of words and does not represent semantic relations between words. In thispaper we introduce a distributed implementation for the bisecting k-means using MapReduce programmingmodel. The aim behind our proposed implementation is to solve the problem of clustering intensive datadocuments. In addition, we propose integrating the WordNet ontology with bisecting k-means in order toutilize the semantic relations between words to enhance document clustering results. Our presentedexperimental results show that using lexical categories for nouns only enhances internal evaluationmeasures of document clustering; and decreases the documents features from thousands to tens features.Our experiments were conducted using Amazon Elastic MapReduce to deploy the Bisecting k-meansalgorithm
关键词:Document clustering; Ontology; Text Mining; Distributed Computing