Making Immersive Audio Work for VR

The Society of Motion Pictures and Television Engineers® interviewed Jaunt Software Engineering Manager Adam Somers on the state of virtual reality audio engineering.

As the virtual reality (VR) revolution marches onward, much of the technical discussion revolves around realtime rendering of visuals to make an immersive, interactive environment seem natural and intuitive for users. Increasingly, the industry’s attention is focused on the listening experience for VR applications — how, in other words, to construct and present the audio portion of the immersive experience so that what the user hears organically and precisely matches realistically what the user is seeing and interacting with. Immersive 3D audio techniques — many evolving out of the cinema world, including binaural recordings and object-based audio — are being applied skillfully across the industry to address these challenges, but the issue’s complexity means a lot more work is left to be done.
Indeed, according to Adam Somers, Software Engineering Manager at Jaunt, there are particularly unique considerations involved when producing audio content for immersive, 360-degree environments.
“I like to start [the discussion] with traditional cinema,” Somers explains. “People generally have a good understanding of immersive audio in the form of surround sound. All movie theaters have it. Many homes have it, and there are even headphone devices and sound bars that make [surround sound] more accessible. But those applications involve playing sound at specific locations, rather than relative to the listener’s position. That works very well in traditional 2D media to heighten the sense of immersion. But it is the first roadblock you hit when trying to produce VR content.
“That’s because of a channel-based surround sound format played with audio in some sort of fixed position relative to where you want the user to be listening [is different than a VR application]. You can automate the ‘trajectory’ of sound within that sound field, but [for VR], you also need the sound field to react to the user as they are turning their heads, facing in different directions, and so on. Once the user is immersed in a 360-degree experience, you need that reactive component. They have a head-mounted display on, probably experiencing sound over headphones. Therefore, it’s critical that the sound field be able to rotate with head movement.”

Somers calls that “challenge №1” regarding the creation of immersive VR sound — the need for “a rotational component” for the audio track, meaning it can move in response to a user’s movements. But, he adds, there is also a second challenge, which is “the accurate positioning of sound sources in a 360-degree by 180-degree or a full spherical environment.”

“That [challenge] is not addressed by the concept of surround sound,” he relates. “Therefore, you need some spherical sound authoring medium in which to create, transmit, deliver, and render that content to the user.”
Somers predicts “the next generation” of audio technologies aimed at this market will “better integrate perceptual elements.” He explains that by using “the cocktail party” analogy.
“That refers to being surrounded by a group of people, with everyone mostly talking at the same volume,” he says. “In that situation, the person chooses to whom he is going to listen. This phenomenon goes into the field of psychoacoustics, which looks at how people perceive or distinguish individual sound sources by focusing their attention on those sound sources.”
Thus, the problem for virtual reality applications, he suggests, is that even when the user is placed into a 360-degree audio environment, the sound is still transmitted to him. “He is not in the room with the people who are talking [in a VR application],” Somers says. “Therefore, he doesn’t have the same ability to do that kind of perceptual focusing as he would have in the real world.” The challenge for the VR community, as a result, is “how to fully capture the sense of immersion and presence in an environment audibly as well as visually.”
A connected issue is the fact that rich, immersive, real-time sound experiences for virtual reality applications need to be streamed, leading to compression and bandwidth challenges, Somers says. “A lot of work in psycho acoustics initially went into perceptual audio coding, which is how we ended up with [various] perceptual audio compression codecs like AAC, MP3, Opus, and others.”

Somers suggests that lossy signal compression has great relevance for virtual reality because the medium is built around figuring out what part of a digital audio signal a human listener needs to hear at the highest quality. This helps that person to believe in the authenticity of what they are hearing, versus what parts of the signal can be either removed or compressed without impacting the perceptual response to the signal. He suggests this kind of work will need to push further as virtual reality advances to bring the industry to what he calls “the next step, which involves providing immersive audio in a low-bandwidth environment.”
In other words, VR companies need to provide audio experiences that “not only provide efficient, low-bandwidth, high-quality audio, but do so in such a way as to make it entirely immersive and, ideally, also interactive. That way, when you move around a space as a user, you get six degrees of freedom (6Dof), to give you a true-to-reality representation of the audio component of the experience.”
This is where so-called “audio localization,” better known as binaural audio, became important for VR developers as they pursued 3D audio recordings. Somers calls binaural audio “a very hot topic within VR right now,” even though it’s an idea that has been around a long time. At the base level, this involves making “dummy head” recordings with two microphones arranged to capture 3D sound by making a distinction between left ear and right ear sounds, building what Somers calls “head-related transfer functions” (HRTFs) into the audio recording.
This, he says, is an area where the industry is currently doing “huge amounts of research.” But, he adds, the approach is not entirely practical yet for VR applications because “the way we hear is based on our own physiology — the way your head or ears are shaped. In fact, our brains are finely attuned to listening to the world through our personal anatomy.” 
Therefore, binaural audio for VR applications would ideally work best if it were practical “to capture unique HRTF’s for every individual, and to convey all that audio content and render it for the particular individual at the time of delivery.” 
He adds there has been progress in the ability to capture individual HRTFs more rapidly and efficiently than in the past, such as methods pioneered by a Maryland company called Visisonics. “The vision there is for VR enthusiasts to walk into a calibration room, and almost instantly have their precise measurements captured,” he says. “And then, a file of their calibration can be uploaded to whatever device they use to consume content. In that way, immersive audio content will sound genuinely realistic to that person.”
Still, in the wider picture, using such a process for all consumers “is not practical, though it is technically possible,” he adds. That’s why the industry is looking at what Somers calls “a generalized HRTF solution.”
“In that case, you have many companies or research institutions capturing many different samples of head shapes, anatomies, and they average them together to create one solution — a one-size-fits-all that does not require calibration for every individual user.”

Many leading industry players are pursuing this avenue, but even there, an obvious limitation exists, Somers says. That limitation is the simple fact that “it might work reasonably well for one subset of the population, but it might not work at all for another. There is some bell curve distribution for people for whom it works acceptably well, but companies are researching ways to widen that net as much as possible.”
To that end, companies are incorporating the so-called Ambisonics surround sound technique to pursue what Somers calls “the construction of virtual microphones, pointed in any direction that you choose after the material is recorded.”
Ambisonics has also been around a long time. It’s essentially a microphone placement technique “that facilitates the capture of sound in multiple dimensions,” Somers explains. “Rather than just capturing stereo audio using a pair of microphones, or maybe 5.1 sound using an array of microphones, it involves positioning microphones in a certain way so that any microphone could be synthesized from the captured audio signal. Another interesting property about Ambisonics is that it can be rotated. This is nice for VR, because if you are interacting with 360 audio content using a head-mounted display, you can effectively take the head-tracking data coming out of the display, plug it right into the audio engine, and everything will rotate cleanly. That’s why Ambisonics was evident as a potential format for creating and delivering 360 content.”
Indeed, Jaunt, Google, and many others have “adopted Ambisonics” as something of a quasi-standard for VR content, he adds, thanks to having rendering capabilities in their client software that enables playback of such content in realtime.
This need for a full-sphere sound that can be combined with open-source compression technologies to transport audio where it needs to go for immersive content applications has made Ambisonics attractive. But, Somers adds that it, too, has limitations, and that is one reason the industry is exploring object-based techniques.
A key limitation, Somers says, involves the fact that “Ambisonics is fundamentally designed for three degrees of freedom playback. In other words, you can rotate a sound field, which represents sound in all directions, and if you turn your head, you can move the relative positions of any sound source. But if you were to walk around within that sphere, the sound field would not be able to adapt to the motion of your head or body within that space.

That’s a concern for VR use, and so, the industry is examining object-based audio, which is essentially a way of conveying an audio mix in which each independent element in the scene is represented as a point of space.

“The transmission of object audio requires you to convey all the individual tracks of that mix, as well as any metadata associated with each object’s position at any given point in time,” he says. “But the limitation there is that object audio needs a large bandwidth footprint — you have to transmit large numbers of audio files and metadata all the way to the point of rendering to create a mix that someone can listen to in realtime. Many companies are working on this right now.”
There is a wide range of powerful tools available now to help the industry accomplish this, including a variety of off-shelf production toolkits capable of building a “good generalized solution,” as Somers puts it, including Dolby, Blue Ripple Sound, and Facebook’s Audio 360 VR system, among others. He also points out that industry knowledge about how to record, mix, transmit, and play back rich, immersive sound for VR applications has widely expanded across the board. In fact, his company, Jaunt, has published what it calls a Field Guide for Cinematic VR Production — a guide available to anyone that offers pragmatic advice on finding and using hardware and software for cinematic VR production.
While it’s true that the challenge of the cocktail party effect is not fully resolved by any of these approaches, he says the notion that every individual’s anatomical needs do not have to be addressed for a highly immersive experience is the basis of such solutions. “The goal is that the only requirement [for the user to adhere to] be that the headphones are worn correctly, that the left speaker is on the left side and the right speaker is on the right side.”
To achieve this, head-tracking, he says, is critical. To do that, technical linkage of the audio system with a VR technology’s visual component is necessary.

“We are talking about virtual reality after all,” he says. “That is always accompanied by a visual component, and so, you have a head-mounted display — a stereoscopic display system. One of the fundamental requirements of such a system is that it has a head-tracker built into it and that the picture can be rotated to your point of view. On the audio side, we sort of hijack that signal and make use of it to do audio rotation, and so the head-tracking piece is critical. But display-less 3D audio with head tracking would be useful in production environments, where you do not want or need to wear a head-mounted display to try to produce audio to go along with a picture.”
There, he points to work by companies like Dysonics, which offer enthusiasts highly immersive audio-only listening experiences with its stand-alone, headphone head-tracking sensor system, called the Rondo 360. Somers suggests such technology offers great promise for adaptation into the production of VR audio content.
The other thing that will help the situation, he argues, is the eventual development of industry standards for immersive audio formats. That’s hard to achieve at the moment, he says, because “there are new developments almost every day.”
“But building a standard that is extensible and can adapt to the changing landscape of what is technically possible, and what people’s expectations are, would be helpful,” he suggests. “That will be a challenge for standards’ bodies as [VR moves further along]. We already have many different perceptual audio codecs, and then we have lossless formats, as well. There are many different ways of transmitting audio, and device software engineers have to take them all into account. So the creation of an absolute standard would be an excellent baseline to organize thought leaders in each of these areas.”

This piece originally appeared in the July 2017 edition of the SMPTE Newswatch. Thanks to Michael Goldman for conducting the interview.