文章基本信息

标题：claSSifyiNg eStoNiaN web textS
其他标题：eeStikeelSete veebitekStide automaatNe liigitamiNe
本地全文：下载
作者：Kristiina Vaik ; Kadri Muischnek
期刊名称：Eesti Rakenduslingvistika Ühingu Aastaraamat
印刷版ISSN：1736-2563
电子版ISSN：2228-0677
出版年度：2018
卷号：14
页码：215-227
DOI：10.5128/ERYa14.13
语种：English
出版社：Eesti Rakenduslingvistika Ühing (Estonian Association for Applied Linguistics)
摘要：Due to the size of the Internet and the multitude of traditional and new genres there has been an increasing interest in automatic genre classification.Labelling texts in natural language processing is essential because this allows us to select more appropriate language models for the analysis.The aim of the article is to describe and present the results of automatically classifying Estonian Web 2013 texts.We evalued the quality of different classification models on our training and manually labelled test set.Most of the research on automatic classification has focused on classifying multiple genres,while our objective was to do a binary classification.We set out to classify Estonian Web 2013 texts based on whether they are canonical or not.For training we used the Balanced Corpus to represent canonical language and the New Media Corpus to represent non-canonical language.Due to the non-availability of a binary labelled subcorpus of Estonian Web 2013 texts,we compiled it ourselves by manually labelling it.For classification we used different supervised machine learning algorithms and for features a simple Bag of Words method.The results obtained from the preliminary experiments show that neural networks outperformed other machine learning algorithms achieving over 0.7 on accuracy.The overall results of this study indicate that in order to increase the accuracy of the classifiers,new features should be added (e.g POS count,sentences per paragraph,words per sentence,uppercase and lowercase letters per sentence etc.).Our best model,the neural network classifier,achieved an accuracy of 0.99 on a training set but only a little over 0.74 on the test set.This suggests that future work requiers a bigger and more appropriate training set.The manually labelling task showed us that the transition from canonical to non-canonical is very smooth.Current models produce a score between 0 and 1,defining if the item belongs to a class or not.Therefore,the classification models must be programmed to be more predictive so that the predictions can be tuned by selecting a threshold.
其他摘要：Internet on oluline keeleressurss,mille üheks keeleteaduslikuks ja keeletehnoloogiliseks kasutusvõimaluseks on seal leiduvate tekstide koondamine keelekorpuseks.Kuid täisautomaatselt korjatud korpusega seistakse uudse situatsiooni ees: olemas on palju andmeid,ent pole täpselt teada,millist keelematerjali need sisaldavad.Loomuliku keele uurimise ja töötlemise seisukohalt on vajalik tekstide eristamine tekstiliigiti,sest sellest sõltub sobivate töötlusvahendite valik.Artiklis kirjeldame tekstiliikide eristamise lihtsustatud versiooni: korpuse Estonian Web 2013 (etTenTen13) binaarse klassifitseerimise katset,mille eesmärk oli liigitada tekstid kirjakeele normi järgivateks ja mittejärgivateks.Treeningandmetes kasutasime kirjakeele esindajana Tasakaalus korpust ja kirjakeele normi mitte järgivate tekstide esindajana Uue meedia korpust ning testandmetena käsitsi liigitatud Estonian Web 2013 alamkorpust.Klassifitseerimismudelite loomisel rakendasime erinevaid juhendatud masinõppe algoritme ning tunnustena sõnehulkasid.Klassifitseerimismudelite kvaliteeti hindasime 10-kordse ristvalideerimise teel,kus parima tulemuse andis tehisnärvivõrkudel põhinev algoritm,mis 99% täpsusega liigitas dokumendi õigesse klassi.Seejärel katsetasime mudeleid käsitsi liigitatud Estonian Web 2013 testkorpusel,kus parima tulemuse andis taas tehisnärvivõrkudel põhinev algoritm täpsusega 74%.
关键词：corpus linguistics;automatic classification;natural language processing;machine learning;genre;corpus;Estonian
其他关键词：korpuslingvistika;automaatne liigitamine;klassifitseerimine;keeletöötlus;masinõpe;tekstiliik;keelekorpus;eesti keel