An FPGA-based Convolution IP Core for Deep Neural Networks Acceleration

Xuan-Quang Nguyen, Cuong Pham-Quoc


The development of machine learning has made a revolution in various applications such as object detection, image/video recognition, and semantic segmentation. Neural networks, a class of machine learning, play a crucial role in this process because of their remarkable improvement over traditional algorithms. However, neural networks are now going deeper and cost a significant amount of computation operations. Therefore they usually work ineffectively in edge devices that have limited resources and low performance. In this paper, we research a solution to accelerate the neural network inference phase using FPGA-based platforms. We analyze neural network models, their mathematical operations, and the inference phase in various platforms. We also profile the characteristics that affect the performance of neural network inference. Based on the analysis, we propose an architecture to accelerate the convolution operation used in most neural networks and takes up most of the computations in networks in terms of parallelism, data reuse, and memory management. We conduct different experiments to validate the FPGA-based convolution core architecture as well as to compare performance. Experimental results show that the core is platform-independent. The core outperforms a quad-core ARM processor functioning at 1.2 GHz and a 6-core Intel CPU with speed-ups of up to 15.69× and 2.78×, respectively

Full Text:



R. Wu, X. Guo, J. Du, and J. Li, “Accelerating neural network inference on fpga-based platforms—a survey,” Electronics, vol. 10, no. 9, 2021. [Online]. Available:

C. Pham-Quoc, B. Kieu-Do-Nguyen, and T. Ngoc Thinh, “An fpga-based seed extension ip core for bwa-mem dna alignment,” in 2018 International Conference on Advanced Computing and Applications (ACOMP), 2018, pp. 1–6.

C. Pham-Quoc, B. Kieu-Do, and T. N. Thinh, “A highperformance fpga-based bwa-mem dna sequence alignment,” Concurrency and Computation: Practice and Experience, vol. 33, no. 2, p. e5328, 2021, e5328 cpe.5328. [Online]. Available:

C. Pham-Quoc, J. Heisswolf, S. Werner, Z. Al-Ars, J. Becker, and K. Bertels, “Hybrid interconnect design for heterogeneous hardware accelerators,” in 2013 Design, Automation Test in Europe Conference Exhibition (DATE), 2013, pp. 843–846.

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” 2017.

M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” 2020.

L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for convolutional neural networks on fpgas,” in 2017 IEEE 25th Annual International Symposium on Field Programmable Custom Computing Machines (FCCM), 2017, pp. 101–108.

H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high performance fpga-based accelerator for large-scale convolutional neural networks,” in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1–9.

A. Podili, C. Zhang, and V. Prasanna, “Fast and efficient implementation of convolutional neural networks on fpga,” in 2017 IEEE 28th International Conference on Application specific Systems, Architectures and Processors (ASAP), 2017, pp. 11–18.

S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, “Ese: Efficient speech recognition engine with sparse lstm on fpga,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 75–84. [Online]. Available:

M. Samragh, M. Ghasemzadeh, and F. Koushanfar, “Customizing neural networks for efficient fpga implementation,” in 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017, pp. 85–92.

K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “[dl] a survey of fpga-based neural network inference accelerators,” ACM Trans. Reconfigurable Technol. Syst., vol. 12, no. 1, Mar. 2019. [Online]. Available:

Xilinx. (2021) Zynq ultrascale+ mpsoc. [Online]. Available: devices/soc/zynqultrascale-mpsoc.html

V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” 2017. Nvidia. (2017) Nvidia deep learning accelerator open source project.

[Online]. Available:

N. Jouppi, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, C. Young, T. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. Ho, D. Hogberg, J. Hu, and N. Boden, “In datacenter performance analysis of a tensor processing unit,” 06 2017, pp. 1–12.

Avnet. (2021) Ultra96-v2 board - arm-based, xilinx zynq ultrascale+ mpsoc development board based on the linaro 96boards consumer edition specification. [Online]. Available:

Xilinx. (2021) Python productivity for zynq. [Online]. Available:


Copyright (c) 2022 REV Journal on Electronics and Communications

Copyright © 2011-2023
Radio and Electronics Association of Vietnam
All rights reserved