文章基本信息

标题：Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields
本地全文：下载
作者：Pallavi, K. P. ; Sobha, L. ; Ramya, M. M. 等
期刊名称：Journal of Computer Science
印刷版ISSN：1549-3636
出版年度：2018
卷号：14
期号：5
页码：645-653
DOI：10.3844/jcssp.2018.645.653
出版社：Science Publications
摘要：Named Entities (NEs) that exist in the sentences are essential to build Natural Language Processing (NLP) applications for Information Extraction (IE) from large corpora. However, generating a large corpus is challenging for resource poor languages, such as Kannada. Further, there is no annotated corpus available online. The challenges faced in annotating NEs with pre-defined classes are: It is morphologically joined with other words and the spelling variations are more frequent for Kannada words. Sentence structure varies according to morphology, parts of speech (pos) and chunking of a language. These parameters differ from one language to another. To address these challenges, a novel application system is proposed to identify NEs in Kannada using a large corpus of 73,676 tokens. The Named Entity Recognition (NER) system consist of a robust pos tagger and Noun Phrase (NP) chunker developed for generic data. Five gazetteer lists were created from many orthographic patterns for each word. Context information such as previous two words, next two words, word morphology and gazetteer lists were added to feature lists. An unigram-bigram template was designed and incorporated into Conditional Random Fields (CRFs) to generate conditional feature functions. The proposed system resulted in 86.85% and 71.01% f-measure for gold test data and newspaper data respectively.
关键词：Named Entities; Natural Language Processing; Noun Phrase Chunker; Conditional Random Fields