文章基本信息

标题：Determining Unintelligible Words from their Textual Contexts
本地全文：下载
作者：Balázs Pintér ; Balázs Pintér ; Gyula Vörös 等
期刊名称：Procedia - Social and Behavioral Sciences
印刷版ISSN：1877-0428
出版年度：2013
卷号：73
页码：101-108
DOI：10.1016/j.sbspro.2013.02.028
语种：English
出版社：Elsevier
摘要：AbstractWe propose a method to determine unintelligible words based on the textual context of the word determined. As there can be many different possibilities for the word, a robust, large-scale method is needed.The large scale makes the problem sensitive to spurious similarities of contexts: when the contexts of two, different words are similar. To reduce this effect, we induce structured sparsity on the words by formulating the task as a group Lasso problem. We compare this formulation to a k-nearest neighbor and a support vector machine based approach, and find that group Lasso outperforms both by a large margin. We achieve up to 75% of accuracy when determining the word from among 1000 words both on the Brown corpus and on the British National Corpus.Unintelligible words are often the result of errors in Optical Character Recognition (OCR) algorithms. As the proposed method utilizes information independent from information used in OCR, we expect that a combined approach could be very successful, as OCR and the proposed method complement each other.
关键词：Natural language processing;Structured sparse coding;Word recognition;Distributional hypothesis