期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2012
卷号:2012
出版社:ACL Anthology
摘要:The problem addressed in this paper is to segment
a given multilingual document into segments
for each language and then identify the
language of each segment. The problem was
motivated by an attempt to collect a large
amount of linguistic data for non-major languages
from the web. The problem is formulated
in terms of obtaining the minimum description
length of a text, and the proposed solution
finds the segments and their languages
through dynamic programming. Empirical results
demonstrating the potential of this approach
are presented for experiments using
texts taken from the Universal Declaration of
Human Rights and Wikipedia, covering more
than 200 languages.