文章基本信息

标题：METHODS FOR IDENTIFYING LEXICAL AND GRAMMATICAL DIFFERENCES IN MEDICAL APPLIED TEXTS
其他标题：MEETODEID TEKSTIDE LEKSIKAALSETE JA GRAMMATILISTE ERINEVUSTE TUVASTAMISEKS MEDITSIINILISTE TARBETEKSTIDE NÄITEL
本地全文：下载
作者：Raul Sirel
期刊名称：Eesti Rakenduslingvistika Ühingu Aastaraamat
印刷版ISSN：1736-2563
电子版ISSN：2228-0677
出版年度：2013
卷号：9
页码：265-278
DOI：10.5128/ERYa9.17
语种：English
出版社：Eesti Rakenduslingvistika Ühing (Estonian Association for Applied Linguistics)
摘要：This paper introduces some transparent statistical methods for identifying characteristics distinctive for patient information and specification leaflets for human medicines.Though the patient information leaflets and specifications for human medicines have been published by the Estonian State Agency of Medicines and been digitally available for some time,they have not been linguistically analysed nor used in the development of language technology applications.It has been generally accepted that improving the quality of language technology applications often requires genre-specific approaches,for it is common that a model trained on one genre does not produce equally good results when applied to some other genre.It is the aim of the present paper to identify the linguistic features that differentiate the patient information leaflets and specifications for human medicines from each other and from language represented in the Balanced Corpus of Estonian.In order to achieve that,two text corpora containing the texts from 3977 patient information leaflets and 3977 specifications for human medicines have been created and statistically compared with each other and the Balanced Corpus of Estonian.The comparison of the corpora revealed that patient information leaflets and specifications for human medicines contain relatively limited lexicon compared with the Balanced Corpus.This knowledge is relevant,because confined lexicons tend to facilitate the tasks of information mining,automatic summarisation,etc.Furthermore,it appeared that the language in patient information leaflets was somewhat similar (compared to the language in specification leaflets) to the language represented in the Balanced Corpus.Indubitably the collected corpora of patient information leaflets and specifications for human medicines are valuable resources and should be subjects for further research.
其他摘要：Artiklis käsitletakse seni uurimata eestikeelset ressurssi: ravimipakendites sisalduvaid infolehti ja arstidele suunatud ravimeid tutvustavaid kokkuvõtteid.Nimetatud ainestiku analüüsimiseks kasutatakse mõnda läbipaistvat statistilist meetodit,mis võimaldavad kerge YDHYDJDWXYDVWDGDWHNVWLGHWELVWMDåDQULVWWXOHQHYDLGOHNVLNDDOVHLG ja grammatilisi erinevusi.Taolise analüüsi eesmärgiks on ühest küljest katsetada nimetatud meetodite efektiivsust tekste eristavate karakteristikute leidmisel,kuid ka koguda andmestikust lähtuvaid taustateadmisi keeletehnoloogiliste rakenduste efektiivsemaks loomiseks.
关键词：corpus linguistics;text linguistics;text corpora;genre analysis;language technology
其他关键词：korpuslingvistika;tekstilingvistika;tekstikorpused;åDQULDQDOV_NHHOHWHKQRORRJLD?HHVWLNHHO