What codecs are there for immersive and 3D audio?

Immersive audio is a three-dimensional (3D) sound field created by combining lateral and overhead speakers. A variety of industry standard and custom codecs are available for implementing immersive audio.

This FAQ reviews the operation of the MPEG-H Audio (universal immersive audio coding) codec and the still-under-development MPEG-I Immersive Audio (compressed representation for virtual and augmented reality (AR/VR) codec. It then looks at a custom immersive audio codec from Dolby. It closes by briefly considering a series of documents from the Society of Motion Picture and Television Engineers (SMPTE) intended to help standardize immersive audio across multiple implementations.

MPEG-H was developed by the ISO/IEC Moving Picture Experts Group (MPEG) and Fraunhofer IIS. It supports from 8 to 64 speakers and up to 128 codec core channels. The channels can be conventional audio channels, audio objects with 3D location metadata, or a fully spherical ‘ambisonics’ surround sound format. It can support a range of listening environments, from large surround systems to headphones and virtual reality goggles.

MPEG-H begins with a perceptual decoder for compression of input signal classes, including channels, objects, and higher-order ambisonics (HOA), using extended MPEG Unified Speech and Audio Coding for three dimensions (USAC-3D). Next, channel signals, objects, and HOA coefficients are decoded and rendered to the target reproduction loudspeaker layout through dedicated renderers. The resulting virtual signals are downmixed to physical speakers or sent through a biaural renderer for listening on headsets and similar environments (Figure 1).

Figure 1. MPEG-H immersive audio decoding structure (Image: Cambridge University Press).

MPEG-I uses MPEG-H as its foundation and adds features for AR/VR. MPEG-I combines an MPEG-H bitstream with a MPEG-I bitstream. The MPEG-I bitstream describes the AR/VR scene information. The renderer also incorporates information on the environment, like its acoustic and geometric properties, dynamic user orientation, and position updates. The MPEG-I renderer also uses the Scene State data that includes the current state of all 6DoF metadata that describes the six mechanical degrees of freedom for the listener’s head in three-dimensional space (Figure 2).

Figure 2. MPEG-I immersive audio codec architecture (Image: Audio Engineering Society).

Dolby Atmos
MPEG-H and Dolby Atmos AC-4 codecs are considered Next-Generation Audio (NGA) systems. They are object-based and support immersive audio. They are similar but not the same. For example, both support interactivity. MPEG-H uses ‘presets’ while AC-4 uses ‘presentations.’ Dialog enhancement is an important feature in AC-4. It includes scalable bitrates for side information that enables user control of the relative level of the dialog channel. The Speech Spectral Frontend (SSF) is a prediction-based coding tool that can reduce the bitrates for speech content. For general audio, the Audio Spectral Frontend (ASF) is used. Other features of AC-4 include video frame synchronization coding, loudness management, hybrid delivery over broadcast and broadband connections, dynamic range control, and extensible metadata deliver format (EMDF) elements for incremental metadata information.

Standardization is an important consideration for NGA systems. AC-4’s core technology has been standardized by the European Telecommunications Standards Institute (ETSI) as TS 103 190. Digital Video Broadcasting (DVB) has incorporated it into TS 101 154, and it’s been adopted by the Advanced Television Systems Committee (ATSC) for ATSC 3.0. The SMPTE has taken extensive steps to develop compatibility of immersive audio across a variety of codecs.

SMPTE 2098
The ST 2098 suite of documents from SMPTE is aimed at standardizing immersive audio. Some of the elements of the ST 2098 suite include:

  • 2098-1 defines immersive audio metadata.
  • 2098-2 is the primary document and the Immersive Audio Bitstream (IAB) specification.
  • 2098-3 describes immersive audio renderer operating expectations and testing recommendations.
  • 2098-4 for immersive audio renderer interoperability testing.
  • 2098-5 defines digital cinema immersive audio channels and sound field groups.

ST 2098 is primarily based on Dolby Atmos but has been created to be extensible and backward-compatible. Several immersive audio systems, including Dolby Atmos, Barco Auromax, and DTS: X, have successfully tested interoperability.

Summary
There are several codecs like MPEG-H and Dolby AC-4 available for immersive audio. More advanced implementations like MPEG-I are under development, and there’s an industry-wide effort being led by the SMPTE to develop interoperability standards for immersive audio codecs.

References
Dolby AC-4: Audio delivery for next-generation entertainment services, Dolby
Immersive audio, capture, transport, and rendering, Cambridge University Press
MPEG-I Immersive Audio – Reference Model For The Virtual/Augmented Reality Audio Standard, Audio Engineering Society
MPEG Standards for Compressed Representation of Immersive Audio, IEEE
SMPTE ST 2098-2:2019, IEEE