期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2012
卷号:2012
出版社:ACL Anthology
摘要:We examine some of the frequently disregarded
subtleties of tokenization in Penn Treebank
style, and present a new rule-based preprocessing
toolkit that not only reproduces the
Treebank tokenization with unmatched accuracy,
but also maintains exact stand-off pointers
to the original text and allows flexible configuration
to diverse use cases (e.g. to genreor
domain-specific idiosyncrasies).