Windows 11

Modular YOLOv8 Optimization for Real-Time UAV Maritime Rescue

November 7, 2024

Introduction

The task of UAV-based maritime rescue object detection faces two significant challenges: accuracy and real-time performance. The YOLO series models, known for their streamlined and fast performance, offer promising solutions for this task. However, existing YOLO-based UAV maritime rescue object detection methods tend to prioritize high accuracy, often at the expense of real-time performance and ease of implementation and expansion.

This study proposes a modular plug-and-play optimization approach based on the YOLOv8 framework, aiming to enhance real-time performance while maintaining high accuracy for UAV maritime rescue object detection. The proposed optimization modules are flexible, easy to implement, and extendable. In experiments on the large-scale publicly available SeaDronesSee dataset, our method achieved a 13.53% improvement in accuracy over YOLOv8x while reducing computational cost by 85.63%. Additionally, it surpassed the detection speed of the SeaDronesSee official code’s two-stage detector by over 20 times, while maintaining comparable accuracy.

Furthermore, our analysis of the experimental results highlights differences in detection difficulty among various objects and potential biases within the dataset.

Background and Challenges

The task of UAV-based maritime rescue object detection poses two primary challenges for the field of computer vision: accuracy and real-time performance.

Compared to traditional daily life scene images, maritime rescue images captured from a high-altitude “bird’s-eye view” by UAVs lack a fixed “sky above, ground below” orientation. The visual characteristics and sizes of boats, personnel, and rescue equipment in this unique scene vary with changes in time, weather, altitude, and angle, presenting significant challenges to detection accuracy.

Moreover, the high real-time requirements of rescue missions and the constraints of UAV equipment demand streamlined model structures and rapid detection performance. To address these challenges, the SeaDronesSee dataset, specifically focused on this scenario and task, was constructed and has attracted extensive research.

Among the proposed methods, YOLO-based models are known for their streamlined structure and real-time performance. However, possibly due to the accuracy-focused ranking mechanism of the SeaDronesSee leaderboard, current YOLO-based research emphasizes refining model structures to pursue extreme accuracy, somewhat neglecting real-time performance and the ease of implementation and expansion.

Proposed Modular Optimization Approach

In response to these issues, our research is based on the latest stable version of the YOLO series, YOLOv8. Leveraging YOLOv8’s modular design, we propose a modular plug-and-play optimization approach, focusing primarily on enhancing real-time performance while maintaining accuracy.

The key components of our proposed approach are:

1. Enhancing the PANet Structure in the Neck Section

We add an extra block to both the top-down and bottom-up pathways of the Path Aggregation Network (PANet) structure, concatenating it to the Backbone FPN’s P2 features and adding corresponding detection heads. This enhancement directly leverages the P2 features from the Backbone, which contain low-level local details, beneficial for accurate small object detection.

2. Incorporating Attention Mechanisms Post-Upsample Layers in the Neck Section

Adding attention mechanisms, specifically the Convolutional Block Attention Module (CBAM), after each upsample layer in the Neck section helps the model focus on important features in both the channel and spatial dimensions while suppressing irrelevant and redundant information.

3. Integrating Swin Transformer at the P3 Position of the Backbone FPN

Introducing the Swin Transformer at the P3 position of the Backbone improves the model’s feature extraction and generalization capabilities, aiming to enhance the detection of small and multi-scale objects.

4. Introducing GAM at the P5 Position of the Backbone FPN

Incorporating the Global Attention Mechanism (GAM) at the P5 position enables the model to capture long-range dependencies, thereby enhancing its ability to detect large objects.

These four modular plug-and-play optimization modules encompass various parts of the YOLOv8 backbone’s FPN structure and the Neck section’s PANet structure. Each module is independent, allowing for flexible disassembly and combination in ablation experiments.

Experiments and Results

We utilized the large-scale, publicly accessible UAV-based maritime search and rescue object detection dataset, SeaDronesSee, to validate our proposed method.

The SeaDronesSee dataset captures various ships, different types of personnel, and rescue equipment in this specialized scenario. These images were taken by various representative UAVs from different altitudes, angles, and times over open waters.

Our statistical analysis of the dataset revealed significant size differences among various objects, with the boat category being 4 to 7 times larger than the swimmer and floater categories, and 31 times larger than the life jacket category. Additionally, due to varying UAV shooting altitudes and angles, objects of the same category exhibit considerable size differences in different images.

We conducted experiments on a Windows 11 machine equipped with an RTX 4090 GPU, utilizing Python 3.9, PyTorch 2.3.0, and the official YOLOv8 code repository. The experiments were configured with a maximum of 350 training epochs and an early stopping criterion set at 100 epochs.

To comprehensively evaluate the model’s performance, we used the following metrics:

GFLOPs: Measuring the computational complexity of the model.
AP50: Representing the average precision at an IoU threshold of 0.5.
AP50-95: Providing a more comprehensive evaluation of the model’s detection performance by averaging the precision across multiple IoU thresholds, ranging from 0.5 to 0.95.
FPS: Indicating the number of image frames the model processes per second during inference.

Our ablation study results showed that the modular plug-and-play optimization approach significantly enhances the performance of the YOLOv8 model. The YOLOv8-PA-CBAM-s model, which incorporates the PANet enhancement and CBAM attention modules, achieved the highest average precision (AP50-95) of 59.7%, outperforming the default YOLOv8-s model by 13.53%.

Furthermore, when comparing the optimized models with various YOLOv8 scales and two-stage detectors, we found that our optimized models maintained comparable accuracy while significantly improving inference speed. The YOLOv8-PA-STP3-GAM-CBAM-s model, for instance, surpassed the detection speed of the SeaDronesSee official code’s two-stage detector by over 20 times, while maintaining similar accuracy.

Our analysis of the experimental results also revealed differences in detection difficulty among various object categories. The boat category was the easiest to detect, while the swimmer on boat, floater on boat, and life jacket categories were the most challenging.

Conclusion and Future Work

In this study, we proposed a modular plug-and-play optimization approach based on the YOLOv8 framework, aiming to enhance real-time performance while maintaining high accuracy for UAV maritime rescue object detection. Our experiments on the SeaDronesSee dataset demonstrated the effectiveness of our approach, achieving significant improvements in both accuracy and speed compared to the default YOLOv8 model and state-of-the-art two-stage detectors.

However, our research is not without limitations. The discrepancy between experimental data and real-world scenarios, as well as the integration and deployment of the algorithm with specific UAV hardware and control systems, are areas that require further investigation.

Future research directions include re-dividing and validating the SeaDronesSee dataset to ensure fairness and stability in model evaluation, further exploring and optimizing the modular plug-and-play YOLOv8 models to achieve even higher speed and accuracy, and deploying the models in real-world applications to verify their actual performance.

By leveraging the modular design of YOLOv8 and our proposed optimization approach, we aim to provide a flexible and scalable solution for fast and accurate maritime rescue object detection, ultimately contributing to the advancement of UAV-based emergency response and search and rescue operations.

References

Varga, L. A., Kiefer, B., Messmer, M. & Zell, A. Seadronessee: A maritime benchmark for detecting humans in open water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2260–2270 (2022).
Gonçalves, L. & Damas, B. Automatic detection of rescue targets in maritime search and rescue missions using uavs. In 2022 International Conference on Unmanned Aircraft Systems (ICUAS) 1638–1643 (IEEE, 2022).
Sohan, M., Sai Ram, T., Reddy, R. & Venkata, C. A review on yolov8 and its advancements. In International Conference on Data Intelligence and Cognitive Informatics 529–545 (Springer, 2024).
Ultrylytics. Ultralytics yolov8 (2024, accessed 28 Apr 2024). https://github.com/ultralytics/ultralytics.