Google Neural Network Can Isolate Individual Voices in Videos

Google Neural Network Can Isolate Individual Voices in Videos

The bleeding edge of computer science these days is all about making computers more like humans. We’re using neural networks to help machines recognize objects, play games, and even speak in a more realistic way. In a new feat of machine learning magic, Google Research has developed a system that can replicate the “cocktail party effect,” where your brain focuses on a single audio source in a crowded room. The results are impressive — almost worryingly so.

Google calls this technique “Looking to Listen” because it watches videos with multiple speakers to split up the audio—it uses both auditory and visual signals, just like your brain does. There’s nothing special about these videos, either. They’re just videos with a single audio track consisting of more than one person.

Google Neural Network Can Isolate Individual Voices in Videos

To build a tool capable of this, Google started with 100,000 samples of high-quality lectures and talks from YouTube. Engineers chopped up the videos to get segments of clean speech with clearly visible speakers and no background noise. That left Google Research with 2,000 hours of video consisting of a single person speaking (they call this the AVSpeech data set). The trick was using these clean samples to create “fake” cocktail parties. The researchers combined the videos, so multiple people were speaking. That’s the data Google used to train its neural network.

Like many other Google Research breakthroughs, this one used a convolutional neural network. The input to the network consists of visual features of the speakers as well as the spectrogram of the video’s soundtrack. By processing the video, the network learns how to separate the video into a “time frequency mask” for each speaker. The output mask is matched up with the audio input spectrogram to generate separate audio tracks.

With the training done, Google unleashed the network on new videos. As you can see in Google’s examples, this works surprisingly well. The Looking to Listen model can identify what audio is coming from a speaker, and filter out everything else. This technology could have applications in video conferencing, hearing aids, and video surveillance.

On that last point, this technology could be so powerful that it’s not hard to imagine scenarios where it’s abused. With future speed and accuracy improvements, an observer could pick out your voice on a crowded street to find out what you said. There’s no indication Google has any intention of doing that, but it’s not alone in doing neural network research.

Continue reading

Google Pixel Slate Owners Report Failing Flash Storage
Google Pixel Slate Owners Report Failing Flash Storage

Google's product support forums are flooded with angry Pixel Slate owners who say their devices are running into frequent, crippling storage errors.

Google Kills Free Photo Storage, Changes What Counts Toward Storage Caps
Google Kills Free Photo Storage, Changes What Counts Toward Storage Caps

Google has announced some significant changes to Photos, especially if you use the service for automatic backup.

Time to Update: Google Patches 2 Severe Zero-Day Chrome Vulnerabilities
Time to Update: Google Patches 2 Severe Zero-Day Chrome Vulnerabilities

Unlike the last few zero-days, Google didn't find these security holes itself. Instead, it was tipped by anonymous third-parties, and the problems are severe enough that it hasn't released full details. Suffice it to say, you should stop putting off that update.

Nvidia, Google to Support Cloud Gaming on iPhone Via Web Apps
Nvidia, Google to Support Cloud Gaming on iPhone Via Web Apps

Both Nvidia and Google have announced iOS support for their respective cloud gaming platforms via progressive web applications. Apple can't block that.