YOLO目标检测后处理算法的优化和硬件加速

邹知炜; 孙文浩; 陈松

doi:10.19304/J.ISSN1000-7180.2022.0896

YOLO目标检测后处理算法的优化和硬件加速

Algorithm optimization and hardware acceleration for YOLO post processing

摘要

摘要: YOLO目标检测网络系列因具有高精度低延时的特点而得到广泛运用，但如何加速其后处理仍未得到充分研究。利用YOLO计算特点，优化了后处理算法：（1）融合detect层和后处理计算过程，通过将置信度阈值判断移至detect层计算前，避免无效计算和通信；（2）结合模型量化，实现基于脉动阵列的后处理硬件加速。实验表明：YOLOv3、YOLOv5的detect层卷积计算量减少了87.3% ~ 99.9%；加速硬件设计在Virtex Ultrascale+ VCU112上实现，100 MHz时钟频率下，YOLOv3的detect层与后处理计算相较优化前加速比达到7.2 ~ 9.3，在3 000选框中筛选5个最佳选框条件下延时1 736 μs。相比现有工作，本文的detect层与后处理计算速度提升了4.7 ~ 5.0倍，后处理所需FF资源仅为9.9% ~ 10.5%。较后处理优化前，稀疏化的YOLOv3网络整体推理速度提升1.2% ~ 1.3%。

Abstract: YOLO object detection network series have been widely adopted because of its high precision and low latency, but how to accelerate their post processing is not fully studied. Utilizing the characteristics of YOLO, the post processing algorithm is optimized: (1) the detect layer and post processing are merged through threshold judgement in advance, thus redundant computation and communication are avoided; (2) based on model quantization and systolic array, hardware acceleration for post processing is realized. Experiments prove that the convolution of detect layer of YOLOv3 and YOLOv5 is reduced by 87.3% - 99.9%; the hardware design is implemented on the Virtex Ultrascale+ VCU112 with 100 MHz clock frequency. Compared with traditional computation process, the speedup of detection layer and post processing reaches 7.2 - 9.3, and it costs 1 736 μs to select 5 best boxes out of 3 000 candidates. We have an edge over previous works for 4.7 - 5.0 speedup of detect layer and post processing while only 9.9% - 10.5% FF are used in post processing. The optimization improves the overall inference speed of sparse YOLOv3 by 1.2% - 1.3%.

HTML全文

参考文献(11)

施引文献

资源附件(0)