首页    期刊浏览 2024年11月28日 星期四
登录注册

文章基本信息

  • 标题:Determining Unintelligible Words from their Textual Contexts
  • 本地全文:下载
  • 作者:Balázs Pintér ; Balázs Pintér ; Gyula Vörös
  • 期刊名称:Procedia - Social and Behavioral Sciences
  • 印刷版ISSN:1877-0428
  • 出版年度:2013
  • 卷号:73
  • 页码:101-108
  • DOI:10.1016/j.sbspro.2013.02.028
  • 语种:English
  • 出版社:Elsevier
  • 摘要:AbstractWe propose a method to determine unintelligible words based on the textual context of the word determined. As there can be many different possibilities for the word, a robust, large-scale method is needed.The large scale makes the problem sensitive to spurious similarities of contexts: when the contexts of two, different words are similar. To reduce this effect, we induce structured sparsity on the words by formulating the task as a group Lasso problem. We compare this formulation to a k-nearest neighbor and a support vector machine based approach, and find that group Lasso outperforms both by a large margin. We achieve up to 75% of accuracy when determining the word from among 1000 words both on the Brown corpus and on the British National Corpus.Unintelligible words are often the result of errors in Optical Character Recognition (OCR) algorithms. As the proposed method utilizes information independent from information used in OCR, we expect that a combined approach could be very successful, as OCR and the proposed method complement each other.
  • 关键词:Natural language processing;Structured sparse coding;Word recognition;Distributional hypothesis
国家哲学社会科学文献中心版权所有