期刊名称:The International Arab Journal of Information Technology
印刷版ISSN:1683-3198
出版年度:2011
卷号:8
期号:2
出版社:Zarqa Private University
摘要:Multilingual natural language processing systems are increasingly relying on parallel corpus to ameliorate their output. Parallel corpora constitute the basic block for training a statistical natural language processing system and creating translation and language models. Several systems have been devised that automatically align words of a pair of sentences, teach in a language. Such systems have been used successfully with European languages. In this paper, one such system is used to align sentences in an English-Arabic corpus. The system works poorly given raw unaligned sentence English-Arabic sentence pairs. This prompted the development of a preprocessing step to be applied to the Arabic sentences. The same corpus was then preprocessed and a significant improvement is reported when alignment is attempted using the preprocessed unaligned sentences.
关键词:Word alignment; sentence alignment; parallel corpora; and statistical natural language processing.