It's hard enough for us to keep track of who's talking in a loud or crowded party, imagine how difficult it is for automated systems to follow. Speech recognition at a reasonable quality is really only something that's been mastered in the last decade or two, add in conflicting sounds as people talk over each other, and an already tricky problem becomes much harder.
Fortunately (or unfortunately) for us, researchers at Google have been working on isolating sources of audio like speech in videos, and the results they showed off yesterday are kind of incredible and simultaneously terrifying.
Separating audio like speech from ambient voices or sounds is something people are good at, but which automated systems have difficulty with. In the right circumstances, we're easily able to mentally tune things out to focus on a single speaker, but a microphone picking up sound from multiple sources can't do the same thing. At least, not by itself.
Researchers at Google have built a machine learning-powered system that can pick out specific sounds like speech in a video. And I don't just mean isolating spoken words from background audio sources like ambient noise (though it can do that, too), but entirely separating the speech of two people talking simultaneously. And based on the results, it can do a better job than we can.
The method the researchers used for training the network is pretty ingenious, too—after all, the hardest part of machine learning is figuring out how best to "teach" it to get the results you want. In this case, they built "fake cocktail parties," composed of manually spliced "clean" sources of audio and video, overlaid with similarly clean background noise. That data is then fed to the network, training it with facial movements from the video and spectrograms of the merged audio track. It's then able to determine which frequencies at which times are most likely to correspond to a given speaker and that data is then extracted into a new isolated audio track, the ultimate result.
Of course, the concept seems easy enough when the two speakers have drastically different voices, like the two examples above. If it's isolating audio based on frequency, the bigger the pitch difference between the speakers' voices, the better the results. But what about when you splice together two videos of the same speaker and try to isolate them?
Google unfortunately (and inexplicably) took down the spliced Sundar video, which was the best example of similar frequency use, so you'll have to trust my assessment. In the video, you could hear a few irregularities as the two overlaid Sundars were using similar frequencies at the same time, but the results were pretty stunning. Frankly, I've had phone calls without background noise that sounded worse.
The privacy implications of something like this are honestly pretty serious. If performance can be improved, a system like this could even be able to pick a single voice out from a crowd on the street. Even in the seemingly public privacy of a loud group, what you say could be individually picked out by a third party observer. Right now it doesn't quite seem like it's up to that task, but given a large enough array of microphones and cameras, who knows? It might not be far off.
Google took down the video of Sundar talking over himself (which was the best example out of all the demonstrations), so some portions of this article that made reference to it while it was embedded have been rewritten.