首页    期刊浏览 2024年07月08日 星期一
登录注册

文章基本信息

  • 标题:MapReduce Platform for Parallel Machine Learning on Large-scale Dataset
  • 本地全文:下载
  • 作者:Toshihiko Yanase ; Keiichi Hiroki ; Akihiro Itoh
  • 期刊名称:人工知能学会論文誌
  • 印刷版ISSN:1346-0714
  • 电子版ISSN:1346-8030
  • 出版年度:2011
  • 卷号:26
  • 期号:5
  • 页码:621-637
  • DOI:10.1527/tjsai.26.621
  • 出版社:The Japanese Society for Artificial Intelligence
  • 摘要:We propose a computing platform for parallel machine learning. Learning from large-scale data has become common, so that parallelization techniques are increasingly applied to machine learning algorithms in order to reduce calculation time. Problems of parallelization are implementation costs and calculation overheads. Firstly, we formulate MapReduce programming model specialized in parallel machine learning. It represents learning algorithms as iterations of following two phases: applying data to machine learning models and updating model parameters. This model is able to describe various kinds of machine learning algorithms, such as k-means clustering, EM algorithm, and linear SVM, with comparable implementation cost to the original MapReduce. Secondly, we propose a fast machine learning platform which reduces the processing overheads at iterative procedures of machine learning. Machine learning algorithms iteratively read the same training data in the data application phase. Our platform keeps the training data in local memories of each worker during iterative procedures, which leads to acceleration of data access. We evaluate performance of our platform on three experiments. Our platform executes k-means clustering 2.85 to 118 times faster than the MapReduce approach, and shows 9.51 times speedup with 40 processing cores against 8 cores. We also show the performance of Variational Bayes clustering and linear SVM implemented on our platform.
  • 关键词:large-scale data ; machine learning scalability ; MapReduce ; distributed computing ; programming model
国家哲学社会科学文献中心版权所有