MMA-Net: Multi-Modal Attention Network for 2-D Object Detection in Autonomous Driving

Abstract

Autonomous driving technology relies heavily on sensor data for environment perception. Heterogeneous sensors such as lidar, radar, and camera have their own strengths and limitations. Therefore, relying on any single sensor would restrict the effectiveness of autonomous driving technology. However, integrating data from such heterogeneous sensors poses challenges due to differences in their representations. This article outlines a deep learning network aimed at designing modality-agnostic multi-modal fusion architecture. We study sensor data from different modalities and learn fine-grained representations using modality-specific feature encoders independently. Then, a multimodal attention-based network (MMA-Net) is proposed to fuse the data from heterogeneous modalities. The proposed MMA-Net fuses multi-modal sensor data by jointly exploiting the inter-modality and intra-modality relationships among camera, lidar, and radar sensors. The effectiveness of the proposed multi-modal fusion architecture is demonstrated using 2-D object detection metrics through extensive experiments on a dataset generated using the CARLA simulator.

Publication
2025 IEEE International Conference on Acoustics, Speech and Signal Processing
Shubh Goel
Shubh Goel
Computer Science

My research interests include Embodied AI, Computer Vision and Deep Learning.