Logo

Spatiotemporal Audio-Visual Robot Learning


The instinctive spatial audio-visual learning ability of humans and some animals has long become a source of inspiration for embodied AI and robotics implementation. However, the bio-inspired adoption by artificial agents introduces major theoretical and technical challenges. Processing multiple data streams with diverse spatiotemporal representations, such as spatial audio-visual data, is indeed beyond the capability of conventional Machine Learning and unimodal Deep Learning methods. These challenges expose the unseen barriers to multimodal data integration, leading us to a novel perspective of Multimodal Machine Learning to facilitate spatial audio-visual interactions.

Date: 2024 - 2028

Persons participating in the project:

  • PIs: Dr. Francisco Cruz, A/Prof. Vidhyasaharan Sethu, Dr. Shadi Abpeikar
  • Associates: Hadha Afrisal
  • Corresponding contact: hadha.afrisal@unsw.edu.au

Research areas:
  • Multimodal Machine Learning
  • Bio-Inspired Robot Learning
  • Spatiotemporal Audio-Visual Segmentation
  • Human-Robot Collaboration

Description:
Although humans and some animals can instinctively perform spatial audio-visual learning, adopting this capability in artificial agents, such as robots, poses theoretical and technical challenges. In the theoretical aspect, there is a limited number of supporting theories on how to optimally and efficiently integrate the audio-visual modalities for robots’ spatial learning using a computational approach. The development of audio-visual learning should also be robust and adaptive to noise and non-ideal conditions for deployment in real-world settings.

One of the most vibrant research areas that sheds light on spatiotemporal audio-visual learning is Audio Visual Segmentation (AVS). AVS aims to segment fine-grained pixel-based regions of objects that emit sounds. However, there are some challenges in implementing AVS for robot learning in real-world environments, such as suboptimal audio-visual cross-modal fusion, overlapping and misalignment in audio-visual matching, complex spatiotemporal correspondences, and instability in complex audio-visual scenarios. The current mainstream AVS methods also focus solely on extracting 2D pixel-wise segments of sounding objects, without estimating the relative distance of the sounding object, which hinders the full-potential implementation for 3D world settings.

Media:
Additional images/video



Selected Publications Web
Afrisal, H., Abpeikar, S., & Cruz, F. (2025, November). Depth-Aware Audio Visual Segmentation with Geometry-Heuristic Cross Attention. In Australasian Joint Conference on Artificial Intelligence (pp. 187-199). Singapore: Springer Nature Singapore.