Meta's latest auditory AIs promise a more immersive AR/VR experience

The Metaverse, as Meta CEO Mark Zuckerberg envisions it, will be a fully immersive virtual experience that rivals reality, at least from the waist up. But the visuals are only part of the overall Metaverse experience.

“Getting spatial audio right is key to delivering a realistic sense of presence in the metaverse,” Zuckerberg wrote in a Friday blog post. “If you're at a concert, or just talking with friends around a virtual table, a realistic sense of where sound is coming from makes you feel like you're actually there.”

That concert, the blog post notes, will sound very different if performed in a full-sized concert hall than in a middle school auditorium on account of the differences between their physical spaces and acoustics. As such, Meta’s AI and Reality Lab (MAIR, formerly FAIR) is collaborating with researchers from UT Austin to develop a trio of open source audio “understanding tasks” that will help developers build more immersive AR and VR experiences with more lifelike audio.

The first is MAIR’s Visual Acoustic Matching model, which can adapt a sample audio clip to any given environment using just a picture of the space. Want to hear what the NY Philharmonic would sound like inside San Francisco’s Boom Boom Room? Now you can. Previous simulation models were able to recreate a room’s acoustics based on its layout — but only if the precise geometry and material properties were already known — or from audio sampled within the space, neither of which produced particularly accurate results.

MAIR’s solution is the Visual Acoustic Matching model, called AViTAR, which “learns acoustic matching from in-the-wild web videos, despite their lack of acoustically mismatched audio and unlabeled data,” according to the post.

“One future use case we are interested in involves reliving past memories,” Zuckerberg wrote, betting on nostalgia. “Imagine being able to put on a pair of AR glasses and see an object with the option to play a memory associated with it, such as picking up a tutu and seeing a hologram of your child’s ballet recital. The audio strips away reverberation and makes the memory sound just like the time you experienced it, sitting in your exact seat in the audience.”

MAIR’s Visually-Informed Dereverberation mode (VIDA), on the other hand, will strip the echoey effect from playing an instrument in a large, open space like a subway station or cathedral. You’ll hear just the violin, not the reverberation of it bouncing off distant surfaces. Specifically, it “learns to remove reverberation based on both the observed sounds and the visual stream, which reveals cues about room geometry, materials, and speaker locations,” the post explained. This technology could be used to more effectively isolate vocals and spoken commands, making them easier for both humans and machines to understand.

VisualVoice does the same as VIDA but for voices. It uses both visual and audio cues to learn how to separate voices from background noises during its self-supervised training sessions. Meta anticipates this model getting a lot of work in the machine understanding applications and to improve accessibility. Think, more accurate subtitles, Siri understanding your request even when the room isn't dead silent or having the acoustics in a virtual chat room shift as people speaking move around the digital room. Again, just ignore the lack of legs.

“We envision a future where people can put on AR glasses and relive a holographic memory that looks and sounds the exact way they experienced it from their vantage point, or feel immersed by not just the graphics but also the sounds as they play games in a virtual world,” Zuckerberg wrote, noting that AViTAR and VIDA can only apply their tasks to the one picture they were trained for and will need a lot more development before public release. “These models are bringing us even closer to the multimodal, immersive experiences we want to build in the future.”