Advertisement

YouTube automates sound effect captions with AI

Its AI can detect laughter, applause and music for the deaf or hard of hearing.

America's Got Talent

YouTube has used algorithms to automatically caption speech for eight years now in an effort to make its billions of videos more accessible for the deaf and hard of hearing. While the feature was pretty rough at first, it has significantly improved it over time, getting "closer and closer to human transcription error rates," Google said in its developers blog. Since speech is just one part of the audio picture, though, YouTube has launched automatic sound effect captioning for the first time.

For now, the system can just show three classes of sounds: Applause, music and laughter. "These were among the most frequent manually captioned sounds, and they can add meaningful context for viewers who are deaf and hard of hearing," the company wrote.

As with the automatic captions, Google uses machine learning to pick out sounds and display them as text. It developed a "deep neural network (DNN)" model for ambient sound, and trained it with "thousands of hours of videos" to get the best results. The toughest part, it wrote in a technical blog, was separating and displaying events that tend to occur at the same, like laughter and applause.

You can see what that looks like in the clip from America's Got Talent below. The sound effects are merged with the automatic speech recognition and "shown as part of the standard automatic captions," much as you'd see in a close-captioned TV show.

YouTube's team said its aware that the captions are "simplistic," but adding features will be easier as it has built a solid back end foundation. In the future, it'll introduce common sounds like barking, knocking or ringing. That will pose new challenges, as the AI will need to figure out if a ringing sound is coming from an alarm, phone or doorbell, for example.

It'll be worth the effort, though, as Google says that two-thirds of participants in a study found that sound effect captions enhance the video experience. And while it's bound to make mistakes no matter how good it gets (even humans are only about 95 percent accurate), users think that the odd error won't detract from the benefits.