Voice camouflage algorithm blocks smart microphones from eavesdropping

Technology News |
By Rich Pell

Microphones are embedded into nearly everything today – from phones, watches, and televisions to voice assistants – and, say the scientists, they are always listening. Computers are constantly using neural networks and AI to process speech, in order to gain information about users.

To prevent this, and as an alternative to using loud background music or other noise to confuse potential listening devices, the scientists developed a new system that generates whisper-quiet sounds that can be played in any room, in any situation, to block smart devices from acoustic spying. And, say the scientists, it’s easy to implement on hardware like computers and smartphones, giving people agency over protecting the privacy of their voice.

“A key technical challenge to achieving this was to make it all work fast enough,” says Carl Vondrick, assistant professor of computer science. “Our algorithm, which manages to block a rogue microphone from correctly hearing your words 80% of the time, is the fastest and the most accurate on our testbed. It works even when we don’t know anything about the rogue microphone, such as the location of it, or even the computer software running on it. It basically camouflages a person’s voice over-the-air, hiding it from these listening systems, and without inconveniencing the conversation between people in the room.”

While this approach to corrupting automatic speech recognition systems has been theoretically known to be possible in AI for a while, achieving them fast enough to use in practical applications has remained a major bottleneck. The problem, say the researchers, has been that a sound that breaks a person’s speech now – at this specific moment – isn’t a sound that will break speech a second later.

As people talk, their voices constantly change as they say different words and speak very fast. These alterations make it almost impossible for a machine to keep up with the fast pace of a person’s speech.

“Our algorithm is able to keep up by predicting the characteristics of what a person will say next, giving it enough time to generate the right whisper to make,” says Mia Chiquier, lead author of the study and a PhD student in Vondrick’s lab. “So far our method works for the majority of the English language vocabulary, and we plan to apply the algorithm on more languages, as well as eventually make the whisper sound completely imperceptible.”

The researchers needed to design an algorithm that could break neural networks in real time, that could be generated continuously as speech is spoken, and is applicable to the majority of vocabulary in a language. While earlier work had successfully tackled at least one of these three requirements, none have achieved all three.

The new algorithm uses what the researchers call “predictive attacks” – a signal that can disrupt any word that automatic speech recognition models are trained to transcribe. In addition, when attack sounds are played over-the-air, they need to be loud enough to disrupt any rogue “listening-in” microphone that could be far away. The attack sound needs to carry the same distance as the voice.

The researchers’ approach achieves real-time performance by forecasting an attack on the future of the signal, or word, conditioned on two seconds of input speech. The researchers optimized the attack so it has a volume similar to normal background noise, allowing people in a room to converse naturally and without being successfully monitored by an automatic speech recognition system. The researchers say they successfully demonstrated that their method works inside real-world rooms with natural ambient noise and complex scene geometries.

For more, see “Real-Time Neural Voice Camouflage.”


Linked Articles