Introduction
In real-world scenarios, small target faces often encounter various challenging conditions, such as complex backgrounds, occlusion, and significant scale changes. These factors can lead to the problem of omission or misdetection in face detection results. To address this issue, this article proposes an improved small target face detection algorithm called 4AC-YOLOv5.
The key improvements of the 4AC-YOLOv5 algorithm are as follows:
-
Small Target Detection Layer: The algorithm introduces a new layer dedicated to detecting faces at much smaller sizes. By fusing more shallow information, this layer enhances the network’s perception of small objects, improving the accuracy of small target detection.
-
Adaptive Feature Fusion Network (AFPN): The AFPN structure replaces the traditional FPN + PAN approach, preventing the large information gap between non-adjacent feature levels and better retaining and integrating different-scale feature information.
-
C3_MultiRes Module: A new multi-scale residual module is introduced, which improves the expressive power of the network by incorporating a multi-branched structure and gradually increasing resolution, while reducing the complexity of the model.
Through these enhancements, the 4AC-YOLOv5 algorithm aims to achieve a balance between detection accuracy and model complexity, enabling robust small target face detection in complex environments.
Benchmark Model: YOLOv5
As one of the widely recognized object detection models, YOLOv5 has made several improvements and optimizations based on YOLOv4, allowing it to better balance detection accuracy and speed. Compared to more recent models like YOLOv7 and YOLOv8, YOLOv5 has lower hardware requirements and faster computing speed, making it more suitable for deployment on intelligent terminal devices.
The YOLOv5 algorithm includes four different-sized models: YOLOv5s, YOLOv5l, YOLOv5x, and YOLOv5m. Among these, YOLOv5s is the smallest model in the series, with the fewest layers and the lowest computational complexity, making it the best performer on devices with limited computing resources. Therefore, the YOLOv5s algorithm is selected as the baseline for further research in this article.
The main structure of YOLOv5s includes the following components:
- Input: Responsible for receiving image data and preprocessing it to adapt to the model’s input requirements.
- Backbone: Adopts the CSPDarknet structure, which is responsible for extracting image features, transforming low-level pixel information into high-level semantic features.
- Neck: Employs the FPN + PAN structure to integrate and fuse feature maps of different scales.
- Head: Designs three detection layers of different sizes, using the GIoU loss function for bounding box regression and an enhanced NMS algorithm for target localization.
While the YOLOv5 model performs well in general object detection tasks, it still faces challenges in accurately detecting small target faces, which are more prone to missed or false detection due to their small size and low resolution.
Proposed Improvements
To address the limitations of the YOLOv5 model in small target face detection, the 4AC-YOLOv5 algorithm introduces the following key improvements:
1. Small Target Detection Layer
The benchmark YOLOv5 model has three detection layers of different sizes (80×80, 40×40, and 20×20). To better handle small target faces, the 4AC-YOLOv5 algorithm introduces an additional small target detection layer with a size of 160×160.
This small target detection layer fuses shallower information, providing richer location details and enhancing the network’s perception of small targets. By incorporating this additional layer, the 4AC-YOLOv5 model can more accurately identify and locate small faces in complex scenes with significant scale variations.
Furthermore, the K-means clustering algorithm is used to regenerate 12 different preset anchor box sizes for the four detection layers, better adapting to the specific requirements of the target scene and reducing the deviation between the target and the anchor box, further improving the detection precision and accuracy.
2. Adaptive Feature Fusion Network (AFPN)
The YOLOv5 model adopts the classic FPN + PAN structure for feature fusion in the neck. While FPN can extract rich feature information at different scales through a top-down path, and PAN aggregates features at different scales through a bottom-up path, the increase in resolution can lead to the loss or degradation of important details in the original feature map, weakening the information transfer and retention between non-adjacent levels.
To address this issue, the 4AC-YOLOv5 algorithm introduces the Asymptotic Feature Pyramid Network (AFPN) structure. In AFPN, low-level features are gradually fused with high-level features through upsampling and residual connections, allowing high-level semantic information to be better integrated into the fusion process.
Moreover, the AFPN structure dynamically adjusts the weights of features through adaptive spatial feature fusion (ASFF) operations, ensuring that targets of different scales receive appropriate attention, further improving the detection performance of small targets.
3. C3_MultiRes Module
The C3 module used in the YOLOv5 model consists of a series of convolutional layers with multiple branches and cross-layer connections, aiming to extract the characteristics of the input data and reduce the parameters and computational complexity of the convolutional layers to improve the model’s efficiency.
Building on the C3 module, the 4AC-YOLOv5 algorithm proposes a new multi-scale residual module called C3_MultiRes. This module incorporates the multi-scale residual structure of the Res2Net model, which can effectively capture features at different spatial scales by decomposing the convolution into multiple sub-modules and connecting them.
The C3_MultiRes module improves the expressive power of the network by introducing a multi-branched structure and gradually increasing resolution, while ensuring the efficient transmission of features and the full utilization of information. This allows the model to better adapt to faces of different scales, gestures, and complex backgrounds, further enhancing the detection accuracy of small target faces.
Experimental Evaluation
The 4AC-YOLOv5 algorithm was evaluated using the WiderFace dataset, a widely recognized face detection benchmark. The dataset includes over 32,000 images and nearly 400,000 high-precision labeled faces, with each face marked with detailed information such as illumination, occlusion, and posture.
The faces in the WiderFace dataset are divided into three scales based on the height of the image: small (10-50 pixels), medium (50-300 pixels), and large (more than 300 pixels). The dataset is further divided into three levels of difficulty: easy, medium, and hard, with the hard subset containing a large number of small target face images, aligning well with the focus of this research.
The experimental results show that the 4AC-YOLOv5 algorithm achieves detection accuracies of 94.54%, 93.08%, and 84.98% on the easy, medium, and hard levels of the WiderFace dataset, respectively. These results outperform the benchmark YOLOv5 model, which had accuracies of 94.43%, 92.82%, and 83.08% on the same levels.
Furthermore, the 4AC-YOLOv5 model demonstrates reduced parameters (7.013 million) and computational effort (5.909 GFLOPs) compared to the benchmark YOLOv5 model, indicating that the proposed improvements have enhanced the model’s efficiency and comprehensive performance.
To further validate the robustness of the 4AC-YOLOv5 model, it was also evaluated on the FDDB face detection dataset, which contains 5,171 labeled face images with various poses, expressions, and lighting conditions. The results show that the 4AC-YOLOv5 model maintains a high true positive rate (TPR) of 0.990 when the false positive (FP) is 1000, outperforming the benchmark YOLOv5 model.
Conclusion
In this article, we have presented the 4AC-YOLOv5 algorithm, an improved small target face detection model based on the YOLOv5 framework. The key enhancements include:
- Small Target Detection Layer: Introducing an additional layer dedicated to detecting smaller faces, improving the model’s perception and localization of small targets.
- Adaptive Feature Fusion Network (AFPN): Replacing the traditional FPN + PAN structure with AFPN, preventing information loss between non-adjacent feature levels and enhancing the integration of multi-scale feature information.
- C3_MultiRes Module: Proposing a new multi-scale residual module to improve the expressive power of the network while reducing the complexity of the model.
The experimental results on the WiderFace and FDDB datasets demonstrate that the 4AC-YOLOv5 algorithm achieves higher detection accuracy, especially for small target faces, while maintaining a smaller model size and lower computational requirements compared to the benchmark YOLOv5 model.
The improvements made in the 4AC-YOLOv5 algorithm provide a practical and efficient solution for small target face detection in complex real-world environments, addressing the challenges of occlusion, scale changes, and intricate backgrounds. This research contributes to the ongoing advancements in computer vision and object detection, paving the way for more robust and accurate face detection applications.
In the future, we plan to further explore the integration of the 4AC-YOLOv5 model into embedded devices and its applications in various outdoor scenarios, leveraging its strong performance and lightweight design.