期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2009
卷号:2009
出版社:ACL Anthology
摘要:As the arm of NLP technologies extends
beyond a small core of languages, techniques
for working with instances of language
data across hundreds to thousands
of languages may require revisiting and recalibrating
the tried and true methods that
are used. Of the NLP techniques that has
been treated as “solved” is language identification
(language ID) of written text.
However, we argue that language ID is
far from solved when one considers input
spanning not dozens of languages, but
rather hundreds to thousands, a number
that one approaches when harvesting language
data found on the Web. We formulate
language ID as a coreference resolution
problem and apply it to aWeb harvesting
task for a specific linguistic data type
and achieve a much higher accuracy than
long accepted language ID approaches.