文章基本信息

标题：Perplexity Method on the N-gram Language Model Based on Hadoop Framework
本地全文：下载
作者：Tahani Mahmoud Allam ; Hatem Abdelkader ; Elsayed Sallam 等
期刊名称：International Arab Journal of e-Technology
印刷版ISSN：1997-6364
出版年度：2015
卷号：4
期号：2
出版社：Arab Open University
摘要：The N-gram language model is used in statistical natural language processing like machine translation and speech recognition. The evaluation method of the N-gram probability needs a testing process. We use a distributed computing platform by using MapReduce algorithm and Hbase tables in Hadoop. Hadoop is an open source implementation of the MapReduce framework. The comparative query process is dependent on the NoSQL database. The NoSQL database is used to store the testing data sets in tables with different structures. The evaluation process uses a MapReduce algorithm on the testing process which acting as a decoder but distributed. This decoder can process multiple testing texts together. There are two ways to perform the MapReduce query on testing data. First one called forward query and the second is hiding query. We focus on the query response time on a single user runs of three different corpora in the N-gram model. The perplexity method is a correct way to estimate the performance of the language model. The perplexity of the testing set is compared with traditional language modeling package SRILM Toolkit. The result is discussed depending on the choice of the different Hbase tables. The results demonstrate that the proposed framework provide enhanced performance such less time cost, small memory size.
关键词：Perplexity model; Distributed language models; N-gram model;MapReduce; Hadoop framework; Hbasetables; SRILM Toolkit.