文章基本信息

标题：A Comparative Study Of Word Representation Methods With Conditional Random Fields And Maximum Entropy Markov For Bio-Named Entity Recognition
本地全文：下载
作者：Maan Tareq Abd ; Masnizah Mohd
期刊名称：Malaysian Journal of Computer Science
印刷版ISSN：0127-9084
出版年度：2018
卷号：31
期号：5
出版社：University of Malaya * Faculty of Computer Science and Information Technology
摘要：BioNamed Entity Recognition (BioNER) is the process of identifying and semantically classifying biomedical technical terms and named entities in Biomedicine literature. Therefore, it is a major task in biomedical knowledge acquisition. Meanwhile, Natural Language Processing (NLP) plays an important role in BioNER in the biomedical domain. The first and most essential biomedical literature mining task incorporates biomedical entity recognition such as protein, gene, and chemicals. The most recent BioNER methods rely on predefined traditional features, which attempt to capture the specific surface properties of entity types. However, these empirically predefined feature sets differ between entity types and are manually constructed and complicated, which means developing them is costly. In this paper, we systematically present a comparative evaluation study of three methods, which are: the traditional feature representation method, the continuous bagofwords (CBOW) model, and a new prototypical representation method with two popular sequencelabeling approaches (Conditional Random Fields (CRFs) and Maximum Entropy Markov Models (MEMM)). We evaluated these models with two major BioNER tasks, which involve the JNLPBA and GENETAG corpora. This paper examined the prototypical word representation method and found that Word2Vec can be successfully used for BioNER. Our results show that the new prototypical representation method improved the performance of the two machine learning models with different datasets. Also, the new prototypical representation method performed better than the traditional feature representation method and CBOW model for both datasets. Finally, our experiment proved that the CRF classifier with the new prototypical representation method achieved the best results when 90% data was used as training data, yielding overall Fmeasure values of 0.79% and 0.85% for the JNLPBA corpus and GENETAG corpus, respectively. In comparison, the results achieved using the ME classifier yielded overall Fmeasure values of 0.76% and 0.78% for the JNLPBA corpus and GENETAG corpus, respectively.
关键词：biomedical named entity; prototypical representation; data representation methods; Word2Vec