文章基本信息

标题：Big Data Processing Model for Authorship Identification
本地全文：下载
作者：Toh Chin Eng ; Shafaatunnur Hasan ; Siti Mariyam Shamsuddin 等
期刊名称：International Journal of Advances in Soft Computing and Its Applications
印刷版ISSN：2074-8523
出版年度：2017
卷号：9
期号：3
页码：1
出版社：International Center for Scientific Research and Studies
摘要：The era of Big Data has arrived and an average of about quintillions of data is produced daily. Data can be in many forms such as image, document or movie. For document file, there are digitalized document and handwritten document that often relates to the issue of copyright or ownership. This is due to improper authentication that leads to unhealthy authorship claimed of that particular handwritten document. Authorship identification is a sub-area of Document Image Analysis and Identification (DIAR). DIAR aim is to analyze and identify document authorship. However, for big scale of documents text images, the issue of document processing time becomes crucial for better authorship identification. Therefore, in this study, we propose an alternative solution to solve the above problems dealing with massive amount of document text images by integrating Hadoop MapReduce and Spark’s MLlib for authorship identification through data processing parallelization. MapReduce processing is used as the platform to pre– process these huge document text images in Hadoop Distributed File Systems (HDFS), follows by the authorship identification through Apache Spark machine learning library.The experiments show the integration is successfully implemented for big size of document text images. However, further improvement is needed for the post-analytics of the reduced document text images for better identification.
关键词：Big Data; Hadoop MapReduce; Spark’s MLlib; Authorship Identification; handwritten text