文章基本信息

标题：An implementation methodology for Neural Network on a Low-end FPGA Board
本地全文：下载
作者：Kaijie Wei ; Koki Honda ; Hideharu Amano 等
期刊名称：International Journal of Networking and Computing
印刷版ISSN：2185-2847
出版年度：2021
卷号：11
期号：2
页码：172-197
语种：English
出版社：International Journal of Networking and Computing
摘要：Artificial Intelligence(AI) has achieved unprecedented success in various fields that include image, speech, or even video recognition. Most systems are implemented on power-hungry devices like CPU, GPU, or even TPU to process data due to the models' high computation and storage complexity. CPU platforms do weak in computation capacity, while energy budgets and expense of GPU and TPU are often not affordable to edge computing in the industrial business. Recently, the FPGA-based Neural Network (NN) accelerator has been a trendy topic in the research field. It is regarded as a promising solution to suppress GPU in both speed and energy efficiency with its specifically designed architecture.Our work performs on a low-end FPGA board, a more desirable platform in meeting the restrictions of energy efficiency and computational resource on an autonomous driving car. We propose a methodology that integrates a NN model into the board using HLS description in this paper. The whole design consists of algorithm-level downscaling and hardware optimization. The former emphasizes the model downscale through model pruning and binarization, which balance the model size and accuracy. The latter applies various HLS design techniques on each NN component, like loop unrolling, inter- /intra- level pipelining, and so on, to speed-up the application running on the target board. In the case study of tiny YOLO (You Only Look Once) v3, the model running on PYNQ-Z1 presents up to 22x acceleration comparing with the PYNQ's ARM CPU. Energy efficiency also achieves 3x better than Xeon E5-2667. To verify the flexibility of our methodology, we extend our work to the BinaryConnect and DoReFaNet. It is worth mentioning that the BinaryConnect even achieves around 100x acceleration comparing with it purely running on the PYNQ-Z1 ARM core.
关键词：Neural Network;model compression;HLS streaming;FPGA;object detection