Spatiotemporal Audio-Visual Robot Learning
The instinctive spatial audio-visual learning ability of humans and some animals has long become a source of inspiration for embodied AI and robotics implementation.
However, the bio-inspired adoption by artificial agents introduces major theoretical and technical challenges.
Processing multiple data streams with diverse spatiotemporal representations, such as spatial audio-visual data, is indeed beyond the capability of conventional Machine Learning and unimodal Deep Learning methods.
These challenges expose the unseen barriers to multimodal data integration, leading us to a novel perspective of Multimodal Machine Learning to facilitate spatial audio-visual interactions.
Date: 2024 - 2028
Persons participating in the project:
- PIs: Dr. Francisco Cruz, A/Prof. Vidhyasaharan Sethu, Dr. Shadi Abpeikar
- Associates: Hadha Afrisal
- Corresponding contact: hadha.afrisal@unsw.edu.au
Research areas:
- Multimodal Machine Learning
- Bio-Inspired Robot Learning
- Spatiotemporal Audio-Visual Segmentation
- Human-Robot Collaboration
Description:
Although humans and some animals can instinctively perform spatial audio-visual learning, adopting this capability in artificial agents, such as robots, poses theoretical and technical challenges.
In the theoretical aspect, there is a limited number of supporting theories on how to optimally and efficiently integrate the audio-visual modalities for robots’ spatial learning using a computational approach.
The development of audio-visual learning should also be robust and adaptive to noise and non-ideal conditions for deployment in real-world settings.
One of the most vibrant research areas that sheds light on spatiotemporal audio-visual learning is Audio Visual Segmentation (AVS).
AVS aims to segment fine-grained pixel-based regions of objects that emit sounds.
However, there are some challenges in implementing AVS for robot learning in real-world environments,
such as suboptimal audio-visual cross-modal fusion, overlapping and misalignment in audio-visual matching, complex spatiotemporal correspondences, and instability in complex audio-visual scenarios.
The current mainstream AVS methods also focus solely on extracting 2D pixel-wise segments of sounding objects, without estimating the relative distance of the sounding object,
which hinders the full-potential implementation for 3D world settings.
Media:
Additional images/video
| Selected Publications |
Web |
| Afrisal, H., Abpeikar, S., & Cruz, F. (2025, November). Depth-Aware Audio Visual Segmentation with Geometry-Heuristic Cross Attention. In Australasian Joint Conference on Artificial Intelligence (pp. 187-199). Singapore: Springer Nature Singapore.
|
|