AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Swapnil Bhosale1, Haosen Yang1, Diptesh Kanojia*1, Jiankang Deng2, Xiatian Zhu1
1University of Surrey, UK, 2Imperial College London, UK

Sound propagation point patterns between a listener (blue sphere) and emitter (yellow sphere) captured by our AV-GS. Notice the points outside the propagation path (points behind the speaker, points behind rigid walls). Please note we slice the scene into half along the y-axis (omitting the points from the ceiling) in order to facilitate better visibility.


Abstract

Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability of characterizing the entire scene environment such as room geometry, material properties, and the spatial relation between the listener and sound source. To address these issues, we propose a novel Audio-Visual Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with an audio-guidance parameter on locally initialized Gaussian points, taking into account the space relation from the listener and sound source. To make the visual scene model audio adaptive, we propose a point densification and pruning strategy to optimally distribute the Gaussian points, with the per-point contribution in sound propagation (e.g., more points needed for texture-less wall surfaces as they affect sound path diversion). Extensive experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.


Method

Overview of our proposed AV-GS. Our model is comprised of a 3D Gaussian Splatting model G, an acoustic field network F and an audio binauralizer B. We first train $G$ to capture the scene geometry information. Next, we construct an audio-focused point representation G_a, with the location X and audio-guidance parameter a initialized by the pre-trained G. Then the acoustic field network F is used to process the a parameters for all the Gaussian points in the vicinity of the listener and the sound source (in the 3D space). The output from F is finally used to condition the audio binauralizer B, which transforms the mono audio to binaural audio w.r.t the listener and sound source location.

Results

Comparison with state-of-the-art methods on RWAVS dataset.

 

In the presence of (a) complex geometry, and (b) meaningless views, AV-NeRF, when compared to our AV-GS makes errors in binaural synthesis. For both scenarios we showcase the corresponding listener view, used by AV-NeRF, as well as the learned holistic scene representation that is used by AV-GS, and hence unaffected by both scenarios.


BibTeX

@inproceedings{AV_GS,
    title={AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis},
    author={Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu},
    journal={arXiv preprint arXiv:2406.08920},
    year={2024}
  }