摘要:MULTEXT-East is a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe word-level syntactic annotations; medium scale morphosyntactic lexica; and annotated parallel, comparable, and speech corpora. The most important component is the linguistically annotated corpus consisting of Orwell's novel ``1984'' in the English original and translations. MULTEXT-East has already seen several editions, with the latest one being Version 3, where the most important addition are the Serbian language resources, including the structurally annotated ``1984'', the morphosyntactic specifications, the morphosyntactic lexicon and the linguistically annotated ``1984''. The complete dataset, unique in terms of languages and the wealth of encoding, is extensively documented, and freely available for research purposes
关键词:natural language processing; language resources; Serbian language; multilinguality