摘要: In real world document classification, a subset of documents often needs to be chosen for labeling as a training
set for a machine learner. Random sampling is generally not the most effective approach for choosing documents to be labeled.
Active learning selects useful examples for labeling to improve the efficiency of learning. We consider two factors
in order to measure the usefulness of a document for labeling. Such a document should be 1) largely unknown to the current
learner 2) influential by being close to many other documents. These factors are stated from a document-centric
viewpoint. A similar analysis can be made from a term-centric viewpoint. It is the purpose of this paper to present this
term-centric approach to active learning using a naïve Bayes classifier. We study both document-centric and our new
term-centric active learning methods. We find good performance of the term-centric methods on numerous data sets with
different characteristics. In addition, a genetic algorithm is employed to compare our results with estimated optimal performance
at fixed training set size and our results are between 84% and 99% of the estimated optimum.