Sonos has unveiled a groundbreaking AI-powered sound processing feature with the launch of AI Speech Enhancement for the Sonos Arc Ultra. My interest piqued, I visited the Sonos UK audio development center to gain insight into the creation of this fascinating innovation.
To summarize, this feature provides users with four distinct levels of dialogue enhancement, setting itself apart from Sonos’ previous offerings by skillfully isolating speech from ambient sounds. This intelligent design amplifies dialogue without sacrificing the dynamic range and immersive Dolby Atmos experiences that make the Arc Ultra a premier soundbar. Having tested it, I can attest that it enhances clarity while retaining impressive bass and intricate sound details.
A significant element of this feature’s development involved collaboration with the Royal National Institute for Deaf People (RNID), the leading UK charity advocating for those with hearing impairments. Over a year, the team worked closely with individuals facing varying levels of hearing loss to refine this enhancement.
This upgrade is not merely an ‘accessibility’ feature but a widely available Speech Enhancement tool that can be easily accessed through the Now Playing screen on the Sonos app. Arc Ultra users now have four options to choose from, a marked increase from the previous two settings. The higher levels are specifically designed for users with hearing difficulties, while the lower levels provide a general boost in dialogue audibility.
To further understand the AI development process and the partnership with RNID, I headed to the Sonos UK product development center for discussions with key team members, including Matt Benatan, Principal Audio Researcher and AI project lead; Harry Jones, Sound Experience Engineer; Lauren Ward, Lead RNID Researcher; and Alastair Moore, RNID Researcher.
Incorporating AI into Audio
You may wonder why this feature is specifically designed for the Sonos Arc Ultra. Benatan elaborated that “the Arc Ultra’s superior CPU capabilities” are essential for processing the advanced AI algorithms involved.
“This technology is known as source separation,” Benatan explains. “It focuses on isolating a specific audio signal from a more complex track. The foundation of this concept lies in telecommunications, where significant development has occurred.”
He further explained, “Traditional methods often aim to remove background noise—like the hum of an air conditioning unit or traffic sounds. Our focus is on enhancement rather than elimination, making it a fundamentally different challenge.”
“In film and TV, audio is deliberately layered to create captivating experiences, blending elements like music and explosions. While these layers are important, dialogue must still be clear for the audience to fully appreciate the content. We aimed to use advanced neural network methods to enhance the processing of digital signals, applying dynamic ‘masks’ that adapt to varying audio frames,” Benatan added.
This innovative technique offers an adaptability that traditional methods lack. The AI performs a frame-by-frame analysis of incoming audio feeds, allowing for effective dialogue isolation.
Training the AI
Training neural networks and AI sound processing techniques requires diverse sample audio files. Sonos honed its AI model with an impressive 20,000 hours of realistic audio data, carefully avoiding copyright issues by not using actual movie content.
This decision was crucial, given the potential legal complexities surrounding the use of copyrighted films, a topic currently under review in the UK.
“Diversity in training data is critical,” Benatan emphasizes. “We collaborated with an award-winning sound designer to create the training materials that ensured our AI gained the insights necessary for future enhancements.”
Data augmentation was also vital, with sound samples processed in varying formats—stereo, 5.1 surround, and Dolby Atmos—allowing the neural network to gain a broader understanding of different audio types.
One might wonder if it would have been easier to train the AI using a library of copyrighted films.
“While that would streamline data collection, it would bypass valuable insights,” Benatan replied.
By collaborating with sound designers, Sonos obtained critical understanding about audio mixing and the intricacies of sound creation, enriching their AI model’s knowledge base.
“Engaging with sound designers led us to explore scene compositions and expectations, providing invaluable feedback that shaped our approach to training on open-source data,” he stated.
Reaching a Broader Audience
Even with extensive data, optimizing the AI model was a top priority. The decision to collaborate with RNID stemmed from personal discussions among the Sonos team members regarding their family members’ challenges with dialogue comprehension.
Benatan noted, “Conversations revealed how much more engaging our content could be for those experiencing hearing difficulties.”
This insight solidified the idea that the model could enhance not just speech clarity but also accessibility for those within the hearing health community.
Lauren Ward shared, “For the first time, we engaged RNID early in our development process, enabling us to gather crucial feedback before finalizing the product.”
With RNID’s involvement, Sonos was able to draw input from experts in both audio technology and hearing loss, significantly influencing the development process.
Ward emphasized the value of insights from individuals who understand both audio tech and hearing loss, making them essential contributors during testing.
The critical nature of these conversations became evident as testing progressed, illustrating the complexities of user experiences with various audio configurations, even among those less adept at articulating their feedback.
Ward noted, “During one test, understanding how to communicate the experience was challenging. When I asked if the voice sounded natural at higher enhancement levels, a participant who had been deaf from birth replied, ‘What is natural?’”
Navigating Loudness Recruitment
A core focus was understanding how sound manipulation can affect user experiences—while adding clarity can be beneficial, it may also lead to discomfort. Benatan explained, “There’s a phenomenon known as loudness recruitment, where softer sounds become nearly inaudible, while louder sounds can become harsh or painful. This understanding was integral in designing our new feature, ensuring dialogue remains comfortably audible without detracting from the overall listening experience.”
“Speech enhancement isn’t just for those with hearing impairments; it strives to offer an enjoyable viewing experience for everyone,” Benatan added.
Balancing Customization with Usability
As we move into an era of personalized audio—much like Denon’s PERL Pro, which customizes sound based on individual hearing profiles—Sonos offers a more straightforward approach with four presets: Low, Medium, High, and Max. I questioned whether this could limit its adaptability to different user preferences.
Ward acknowledged, “Finding the balance between simplicity and personalization can be challenging. Initially, we planned for three settings, but feedback indicated a demand for an additional level.”
“It’s crucial to differentiate between products meant to replicate hearing aids and those intended for entertainment. Our options allow for flexibility; users may prefer different settings based on what they are watching,” she explained.
“One day they may desire maximum immersion and choose to omit speech enhancements; the next, they might prioritize dialogue clarity. With multiple viewers, the combinations are endless.”
An essential aspect of this feature is its user-friendliness.
Ward commented, “There are many variables in the speech enhancement process, and analyzing them all simultaneously can be overwhelming. Integrating the feature into the Now Playing screen ensures it is user-friendly and frequently utilized.”
She stressed the importance of ensuring that accessibility also prioritizes ease of use.
When exploring whether users modified their viewing habits due to the enhanced audio quality, Ward noted that the upgrades not only changed how individuals watched content but also allowed them to explore new types of shows.
“In one test involving a tumultuous sci-fi battle scene, a participant shared that they typically shied away from such content. However, after experiencing the speech enhancement, they felt empowered to engage with it. The enhancements helped them stay involved without feeling overwhelmed,” Ward remarked.
Alastair Moore added that about 50% of individuals over 50 experience some form of hearing loss, highlighting the significant impact that improving audio clarity can have.
Dynamic Solutions for Varied Soundscapes
The culmination of these efforts leads to a well-refined system that knows when to process audio and when to let it be. Harry Jones noted, “We sought to determine when enhancement was necessary. We aimed to avoid altering scenes that didn’t require it, like an exciting chase sequence, while still assisting those that did.”
“Through our work with RNID, we learned that users don’t just want to amplify speech; they want to savor the whole audio experience,” he added.
A significant aspect of design involved preliminarily analyzing sound scenes before processing. Each audio segment is scrutinized for about 5.8 milliseconds, during which the system adapts based on whether dialogue is masked by louder sounds.
Sonos identified 15 potential causes for unclear dialogue in media, ranging from mastering errors to external interferences. While the technology cannot rectify every issue, it illuminates many foundational challenges.
They also classified various sound mixes—ranging from isolated dialogue segments to scenes rich in music and effects.
“The challenge was determining when enhancement should occur versus when merely cleaning up the audio would be sufficient. Dialogue clarity can range from nonexistent to crystal clear. For instance, unclear dialogue amidst background noise requires more intensive intervention, while conversations over music call for subtler adjustments,” Jones pointed out.
“By effectively isolating speech, we can now pinpoint the optimal moments for enhancement,” he noted.
If you’re a Sonos Arc Ultra owner, you’ll soon be able to access this new feature. For many, it may not seem essential, given that the Arc Ultra already provides excellent dialogue clarity. However, I believe that the ‘Low’ setting will resonate with those who have been hesitant to adopt speech enhancement. I’m excited to see if the High and Max settings deliver the level of support that Sonos and RNID envision.