AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Swapnil Bhosale1, Haosen Yang1, Diptesh Kanojia1, Jiankang Deng2, Xiatian Zhu1
1University of Surrey, UK, 2Imperial College London, UK

Sound propagation point patterns between a listener (blue sphere) and emitter (yellow sphere) captured by our AV-GS. Notice the points outside the propagation path (points behind the speaker, points behind rigid walls). Please note we slice the scene into half along the y-axis (omitting the points from the ceiling) in order to facilitate better visibility.


Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability of characterizing the entire scene environment such as room geometry, material properties, and the spatial relation between the listener and sound source. To address these issues, we propose a novel Audio-Visual Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with an audio-guidance parameter on locally initialized Gaussian points, taking into account the space relation from the listener and sound source. To make the visual scene model audio adaptive, we propose a point densification and pruning strategy to optimally distribute the Gaussian points, with the per-point contribution in sound propagation (e.g., more points needed for texture-less wall surfaces as they affect sound path diversion). Extensive experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.


Overview of our proposed AV-GS. Our model is comprised of a 3D Gaussian Splatting model G, an acoustic field network F and an audio binauralizer B. We first train G to capture the scene geometry information. Next, we construct an audio-focused point representation G_a, with the location X and audio-guidance parameter a initialized by the pre-trained G. Then the acoustic field network F is used to process the a parameters for all the Gaussian points in the vicinity of the listener and the sound source (in the 3D space). The output from F is finally used to condition the audio binauralizer B, which transforms the mono audio to binaural audio w.r.t the listener and sound source location.


Comparison with state-of-the-art methods on RWAVS dataset.


In the presence of (a) complex geometry, and (b) meaningless views, AV-NeRF, when compared to our AV-GS makes errors in binaural synthesis. For both scenarios we showcase the corresponding listener view, used by AV-NeRF, as well as the learned holistic scene representation that is used by AV-GS, and hence unaffected by both scenarios.


    title={AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis},
    author={Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu},
    journal={arXiv preprint arXiv:2406.08920},