Page 4 – Team B² – Jamming Attack on Voice Recognition Systems

Status Report #2: 10/5 (Cyrus)

Did code analysis on basic audio input/output demo to reduce latency.
Designed time sync infrastructure to accurately measure the time taken by our program to detect input once the user has started speaking.
Designed block diagrams for various components, figured out a general flow for the presentation.

Status Report #2: 10/5 (Eugene)

Did code analysis on basic audio IO demo from last week to identify ways to decrease latency from hearing sound to responding
Did analysis on query waveforms to figure out best opportunities based on volume to have program recognize a query has started
Set up Alexa and Google Home for testing (1 hour with computing services got nowhere. I learned later that Google Home can’t be set up with G Suite and no one said otherwise)

Status Report #2: 10/5 (Spencer)

Because I am presenting this week for design review, I focused on the design presentation slides and presentation preparation.
I spent my time thinking about the overall narrative for the design presentation as well as making block diagrams for various components.
I also did preliminary investigation into what NLP system & potential filter designs we would want to use for some optimizations to try.

Status Report #1: 9/28 (Group Report)

Talked to Prof. Vyas about how to reframe the problem, since the problem space we specified is much larger than we can handle in a semester. He was concerned that even our normal goals were quite difficult to do and suggested we reframe the problem into attacking either the wake words or select query phrases.
How to reduce latency between when voice detected & audio playback? Seems to take slightly longer than the observed 8ms in our timing code. We are looking into how latency and buffers affect it: http://digitalsoundandmusic.com/5-2-3-latency-and-buffers/. We tried lowering the sample rate to fill the audio buffer more quickly but it did not seem to make a difference.
Risk management: Professor Vyas we consider jamming one or two specific commands instead of the wake word as a backup. This might be a good alternative if the latency is too much for the current version of the problem that we are targeting, because we do not need to generate the jamming input until the user speaks after saying the wake word (gives us more time).
Updated schedule: breaking project into 3 phases to reflect the updated project.
- First phase: Determining jamming inputs (research phase)
  - Defining sample voice inputs and generate voice recordings
  - Reducing latency after detection of audio
  - Set up various black box systems
  - Testing sample inputs on Siri/Google Home/Alexa
- Second phase: Wake word detection
  - Building model for wake word detection
  - Training model to recognize wake word
  - Generating noise after wake word detected
  - Detecting when user has stopped speaking
- Third phase: Timing optimization / generalization of attack
  - Setting up timing infrastructure for testing attack
  - Investigating model to predict time delay between wake phrase
  - Building model for wake phrase length prediction
  - Training/Testing model for wake phrase length prediction
  - Integration
  - Performance Tuning
  - Obfuscation from User
Next week: we need to find better metrics on how often our voice activated systems correctly interpret queries without attempted interference.

Status Report #1: 9/28 (Eugene)

This week, we decided to work on black box systems in an experimental phase. After some experimentation in lab, Spencer and I decided to investigate further by meeting with Professor Stern, who is an expert on voice recognition systems. He directed us to an article published by Apple explaining the underlying mechanics of Hey Siri, which he used as evidence to back his claim that obfuscation of mature black-box systems with server-side processing honed by trillion-dollar companies over the last decade would be a difficult project to accomplish with a relatively limited solution space.

After meeting with Spencer and Cyrus, we decided to pivot the focus and challenge of the exploit to low-latency responses. One new solution/exploit that we are now considering is building an NLP system that can react to “hey Siri”, “OK Google”, and other wake-words as fast as possible. To verify the feasibility of a system like this, I wrote baseline code using Pyaudio to listen for noise and play music as soon as possible. I’ve linked it to a GitHub repo, and you can clone the file to try with any .WAV file (unfortunately, the one I’ve been testing with a copyrighted song, and I don’t want to get arrested for this project).

Next week, I hope to continue identifying jamming signals using a more methodical, experimental approach to better quantify the success rate of certain signals are based off of loudness, distance from the person, speaker, and other metrics. I also want to use Apple’s machine learning paper as a foundation for our NLP system to identify wake-words. Finally, I hope to meet with our advisors and professors to better hone the project details, feasibility, and specification.

Status Report #1: 9/28 (Spencer)

Carried out experiments on how to “jam” wake word on Siri, since we did not have Google Home/Alexa yet. Tests were successful with human voices. However, playing a voice recording of the jamming voice in a loop seemed to give around a 50% success rate. (Done with Cyrus)
Spoke to Prof. Stern about challenges associated with black box systems / discussed current research in speech processing. (Done with Eugene)
Latency testing for the program that Eugene wrote using python and pyaudio: good results. It is very fast to detect input + spit out a predefined output. This is without a neural net in the middle. Establishes that what we are doing is possible (done with Cyrus)
Next week:
- Conduct experiments with Alexa (just arrived) and Google Home (if it arrives soon). Observe if there are differences between how they are activated.
- Discuss how to refine our solution / handle problems with professors.
- Work on design presentation, since I am presenting it.

Status Report #1: 9/28 (Cyrus)

Carried out experiments on how to “jam” wake word on Siri, since we did not have google home/alexa yet. Tests were successful with human voices. However, playing a voice recording of the jamming voice in a loop seemed to give around a 50% success rate. (Done with Spencer)
Redefined the problem as a latency problem: how do we obfuscate the wake word effectively? Need to hit the “s” sound at the same time as siri.
Latency testing for the program that Eugene wrote using python and pyaudio: good results. It is very fast to detect input + spit out a predefined output. This is without a neural net in the middle. Establishes that what we are doing is possible (done with Spencer).

Design Proposal

Last week, Cyrus delivered a presentation about our design proposal. A link is provided here. Check it out!

Project Introduction & Summary

This is the blog page for Team B2’s capstone project. This semester, our team will build a computer program designed to output a signal when someone speaks to a smart speaker, obfuscating their command. While many smart speaker exploits exist, they take advantage of unrealistic setups, such as multiple speakers in one room or playing loud ultrasonic signals that can penetrate walls. Our exploit attempts to prevent smart speaker interaction on commodity hardware, thus preventing access to a home’s IoT network. By the end of the semester, we hope to have created a low-footprint, background process that maliciously runs on a nearby computer to listen for audio and obfuscate commands when someone speaks nearby.