出版社:Eesti Rakenduslingvistika Ühing (Estonian Association for Applied Linguistics)
摘要:The aim of this study was to assess different statistical methods of automatic collocations extraction from the corpus.To extract the collocations,association measures (AM) were applied and the association scores (AS) for the collocation candidates found in the corpus were calculated.An AS indicates the collocational strength between two words.An advantage of the AMs is the fact that in addition to the co-occurrence frequency,the marginal frequencies of collocating words are also taken into account.To calculate the AS,the following data is needed: co-occurrence frequency,marginal frequencies of collocating words,expected frequency and the sample size.There are different approaches to applying AMs: words can be considered collocational only if they appear in the same collocational span,in one text unit (clause,sentence,utterance),or if they carry together some syntactic function.This paper attempts to apply AMs for phrasal verb detection from the Corpus of Estonian Dialects (CED).Texts of CED were morphologically tagged and parsed.Combinations of adverbs and verbs were extracted and AS was calculated for every collocation candidate.Experiments were run on three different dialect groups applying four different association scores: t-score,Mutual Information,chi-squared test and log-likelihood.The results indicate that log-likelihood and t-score outperform MI and chisquared test.The outcomes of different measures vary the most in the Northern dialect group.The best measure for dialect data in general is log-likelihood.However,MI and chi-squared test work well with low frequency data.In the Northern dialect group the best AM for low-frequency phrasal verb detection is MI,however,in the North-Eastern and Southern groups chi-square test works well for the same purpose.To achieve better results different scores should be combined.
其他摘要:Sõnadevahelise seose tugevuse mõõtmise statistikuid kasutatakse arvutilingvistikas püsiühendite tuvastamisel.Statistikud võimaldavad korpuses kahele sõnale arvutada nendevahelise seose tugevuse väärtuse,mille põhjal võib otsustada,kas tegemist on püsiühendiga või mitte.Statistikute kasutamise eelis on,et arvesse ei võeta ainult sõnade koosesinemise,vaid ka ühendit moodustavate sõnade eraldiesinemise sagedusi.Artiklis teen katse rakendada statistikuid Eesti murrete korpuse kaheliikmeliste ühendverbide automaatsel tuvastamisel.Katsetatud on kolme murderühma peal eraldi nelja statistikut: t-skoori,vastastikuse informatsiooni väärtust MI,hii-ruut statistikut ning log-tõepära funktsiooni.*.
关键词:computational linguistics;corpus linguistics;dialectology;methods and tools;statistics;Estonian
其他关键词:arvutilingvistika;korpuslingvistika;murdeuurimine;meetodid ja vahendid;statistika;eesti keel