文章基本信息

标题：Term-Centric Active Learning for Naïve Bayes Document Classification
本地全文：下载
作者：Sunghwan Sohn ; Donald C. Comeau ; Won Kim 等
期刊名称：The Open Information Systems Journal
电子版ISSN：1874-1339
出版年度：2009
卷号：3
页码：54-67
DOI：10.2174/1874133900903010054
出版社：Bentham open
摘要：
In real world document classification, a subset of documents often needs to be chosen for labeling as a training set for a machine learner. Random sampling is generally not the most effective approach for choosing documents to be labeled. Active learning selects useful examples for labeling to improve the efficiency of learning. We consider two factors in order to measure the usefulness of a document for labeling. Such a document should be 1) largely unknown to the current learner 2) influential by being close to many other documents. These factors are stated from a document-centric viewpoint. A similar analysis can be made from a term-centric viewpoint. It is the purpose of this paper to present this term-centric approach to active learning using a naïve Bayes classifier. We study both document-centric and our new term-centric active learning methods. We find good performance of the term-centric methods on numerous data sets with different characteristics. In addition, a genetic algorithm is employed to compare our results with estimated optimal performance at fixed training set size and our results are between 84% and 99% of the estimated optimum.