文章基本信息

标题：RULE-BASED ANNOTATION OF LITHUANIAN TEXT CORPORA
本地全文：下载
作者：Jurgita Kapočiūtė ; Gailius Raškinis
期刊名称：Public Policy And Administration
印刷版ISSN：2029-2872
出版年度：2015
卷号：34
期号：3
DOI：10.5755/j01.itc.34.3.12012
语种：English
出版社：Kaunas University of Technology
摘要：In this paper we present an algorithm that automatically recognizes and annotates person and place names, contractions, acronyms, foreign language phrases, dates and sentence boundaries in Lithuanian texts. The algorithm is based on a set of manually developed template matching rules and a few specialized lexicons. The algorithm performs annotation by making several passes over the text. It can operate in automatic and semi-automatic annotation modes. In the semi-automatic annotation mode, the user is allowed to intervene in cases where automatic decision is uncertain. Users’ feedback is memorized and stored in the lexicons. Rules and lexicons were developed after a careful examination of the text corpus of 600 thousand words. The algorithm was evaluated on a separate corpus of 400 thousand words and achieved ~93% annotation accuracy.