摘要:AbstractWe propose a method to determine unintelligible words based on the textual context of the word determined. As there can be many different possibilities for the word, a robust, large-scale method is needed.The large scale makes the problem sensitive to spurious similarities of contexts: when the contexts of two, different words are similar. To reduce this effect, we induce structured sparsity on the words by formulating the task as a group Lasso problem. We compare this formulation to a k-nearest neighbor and a support vector machine based approach, and find that group Lasso outperforms both by a large margin. We achieve up to 75% of accuracy when determining the word from among 1000 words both on the Brown corpus and on the British National Corpus.Unintelligible words are often the result of errors in Optical Character Recognition (OCR) algorithms. As the proposed method utilizes information independent from information used in OCR, we expect that a combined approach could be very successful, as OCR and the proposed method complement each other.
关键词:Natural language processing;Structured sparse coding;Word recognition;Distributional hypothesis