Google introduces low Bitrate Speech Codec for Smoother Communication


Lyra Codec’s basic architecture is pretty simple. The characteristics themselves have registered post spectrograms, a list of numbers representing the conversation energy of various frequency bands, traditionally used for their perceptual relevance because they are based on human auditory response. At the other extreme, a generative model uses these characteristics to recreate the voice signal. In this sense, Lyra is almost like any other traditional parametric codec, such as MELP.

Currently designed to run at 3 kbps, Lyra outperforms other codecs at that bit rate, which compares favourably with Opus at 8 kbps, thereby reducing bandwidth by 60%. Lyra is often used where bandwidth conditions are not suitable for high bit rates and where existing low bit rate codecs do not meet the standard.

However, the codec also struggles to support high-quality, low-latency communications with minimal real-time data. While it may seem the opposite, high-quality voice codecs require a better bitrate than newer video codecs. The low bit rate for the audio codecs leads to an intelligent and robotic sound structure.

Every 40 minutes, Lyra extracts from the input characteristics or distinctive attributes of speech (a list of numbers representing the energy of the word in various frequency bands called log mail spectrograms) and compresses them before transmitting. At the receiving end, a generative model converts characteristics into a conversation starter.

Lyra’s new and improved “Natural-Sounding” generative models maintain a brown codec bit rate to encompass high-quality codecs, similar to the newer wave codecs commonly used on streaming platforms.

However, one negative of these generative models is computational complexity. To remedy this, Lyra uses some cheaper variants of the Wave RNN, a recurring generative model. Although it works at low speeds, it generates multiple parallel signals at multiple frequencies. These signals are combined to generate an icon at a specific sample rate. Thus, Lyra runs on cloud servers and mid-range phones with 90ms processing latency. According to the Google blog, this generative model trains thousands of hours of voice data and adapts it to generate accurate audio output.

Google trained Lyra with thousands of hours of audio with speakers in more than 70 languages using open source audio libraries, then tested the audio quality with experts and listeners from crowded sources. A Google spokesperson said Lyra aims to create a universally accessible, high-quality audio experience.


Please enter your comment!
Please enter your name here