期刊名称:International Journal of Computer Science & Technology
印刷版ISSN:2229-4333
电子版ISSN:0976-8491
出版年度:2012
卷号:3
期号:2
页码:1103-1109
语种:English
出版社:Ayushmaan Technologies
摘要:Emails become an important medium of communication. A user may receive tens or even hundreds of emails every day. Handling these emails takes much time. Therefore, it is necessary to provide some automatic approaches to relieve the burden of processing the emails. A straightforward method is to group the similar emails by supervised classifications such as mail-id, to-mail-id, subject, message, sending-time, attachments. Email mining is a process of discovering useful patterns from emails. Clustering techniques can be applied over email data to create groups of similar emails. In our algorithm, natural language processing techniques and frequent item set mining techniques are utilized to automatically generate meaningful Generalized Addressing Patterns (GAPs) from mailid, to-mail-id, subject, message, sending-time, attachments of emails. Then we put forward a novel unsupervised approach which treats GAPs as pseudo class labels and conduct email clustering in a supervised manner, although no human labeling is involved. Our proposed algorithm is not only expected to improve the clustering performance, it can also provide meaningful descriptions of the resulted clusters by the GAPs. Experimental results on open dataset and a personal email dataset collected by ourselves demonstrate that the proposed algorithm outperforms the K-means algorithm in terms of the popular measurement F1. Furthermore, the cluster naming readability is improved by 68.5% on the personal email dataset.