出版社:International Association for Computer Information Systems
摘要:The PubMed database contains over 20 million research abstracts ranging from lab experiments and gene arrays topatient-facing research. We introduce a classification task that groups PubMed abstracts into categories of basicscience and clinical research. We present a conditional probability and a decision tree algorithm and compare thetwo algorithms based on three different feature sets. The first feature set consists of semantic tags that appear as verbsin the abstract. The second feature set consists of tags that are nouns and appear as subjects or objects within asentence. The third feature set consists of the first two feature sets combined. Algorithms are evaluated using precision,recall and f-measure measurements. The decision tree algorithm with features made up of both verb tags and tagsfrom subjects and objects outperformed all other combinations achieving a precision of 97 percent and a recall of96.8 percent. The lack of fallback rules when using the conditional probability algorithm hurt its performance. Thedecision tree algorithm was more robust to testing abstracts of different lengths and unseen feature values.
关键词:Information retrieval; document classification; text mining