A2R2 Group | Autonomous Agents and Robotics Research Group

Spatiotemporal Audio-Visual Robot Learning

The instinctive spatial audio-visual learning ability of humans and some animals has long become a source of inspiration for embodied AI and robotics implementation. However, the bio-inspired adoption by artificial agents introduces major theoretical and technical challenges. Processing multiple data streams with diverse spatiotemporal representations, such as spatial audio-visual data, is indeed beyond the capability of conventional Machine Learning and unimodal Deep Learning methods. These challenges expose the unseen barriers to multimodal data integration, leading us to a novel perspective of Multimodal Machine Learning to facilitate spatial audio-visual interactions.

Date: 2024 - 2028

Persons participating in the project:

PIs: Dr. Francisco Cruz, A/Prof. Vidhyasaharan Sethu, Dr. Shadi Abpeikar
Associates: Hadha Afrisal
Corresponding contact: hadha.afrisal@unsw.edu.au

Research areas:

Multimodal Machine Learning
Bio-Inspired Robot Learning
Spatiotemporal Audio-Visual Segmentation
Human-Robot Collaboration

Description:
Although humans and some animals can instinctively perform spatial audio-visual learning, adopting this capability in artificial agents, such as robots, poses theoretical and technical challenges. In the theoretical aspect, there is a limited number of supporting theories on how to optimally and efficiently integrate the audio-visual modalities for robots’ spatial learning using a computational approach. The development of audio-visual learning should also be robust and adaptive to noise and non-ideal conditions for deployment in real-world settings.

One of the most vibrant research areas that sheds light on spatiotemporal audio-visual learning is Audio Visual Segmentation (AVS). AVS aims to segment fine-grained pixel-based regions of objects that emit sounds. However, there are some challenges in implementing AVS for robot learning in real-world environments, such as suboptimal audio-visual cross-modal fusion, overlapping and misalignment in audio-visual matching, complex spatiotemporal correspondences, and instability in complex audio-visual scenarios. The current mainstream AVS methods also focus solely on extracting 2D pixel-wise segments of sounding objects, without estimating the relative distance of the sounding object, which hinders the full-potential implementation for 3D world settings.

Media:
Additional images/video

Selected Publications	Web
Afrisal, H., Abpeikar, S., & Cruz, F. (2025, November). Depth-Aware Audio Visual Segmentation with Geometry-Heuristic Cross Attention. In Australasian Joint Conference on Artificial Intelligence (pp. 187-199). Singapore: Springer Nature Singapore.

A2R2 Research Group	CONTACT	QUICK LINKS
Autonomous Agents and Robotics Research	f.cruz@unsw.edu.au	Google Scholar
School of Computer Science and Engineering	Room 510J, Ainsworth Building (J17)	LinkedIn
UNSW Sydney	Kensington NSW 2052, Australia	Personal webpage