cdaruwal – Team B² – Jamming Attack on Voice Recognition Systems

Status Report #10: (11/30) Cyrus

Met with Prof. Stern wrt how to proceed in our attack
Helped Spencer a little on the Matlab rewriting of the code.
Developed a testing framework in Matlab. Laid out a clear design for the metrics and validation component for our project.
Worked on the final presentation slides.

Met with Stern on Wednesday to try to fix issues with our MFCC + DTW code.
The librosa library we were using in Python does not provide enough options for tweaking the algorithm to our specific application, and hence we decided to switch to MATLAB after a second meeting with Professor Stern.
Verification of our algorithm requires the use of spectrograms, and we are currently having trouble replicating spectrograms in MATLAB for audio samples whose spectrograms are already known.

Worked on using DTW to map the time series of one signal to the time series of another signal.
Wrote DTW code that returns a path with this mapping, but we are unsure how to use this path to actually compare 2 signals, and determine whether they match. Hence, we decided to meet with Stern to figure out what conclusion we can get from DTW, and how this actually solves some of the issues that we experienced with our initial MFCC implementation.
We also started testing under different environments, and realised that ambient noise might be a problem. Hence we are thinking of using adaptive filters to filter noise out.

Updated the schedule for the team to reflect the goals for the next 4 weeks
Looking into dynamic time warping with Spencer for more accurate speech recognition. MFCC seems to be unpredictable when “Hey Siri” is said at different speeds.
Integrating the FastDTW module with Spencer: https://pypi.org/project/fastdtw/

Had to miss the meeting with Professor Stern due to an onsite interview. Also looked into dynamic time warping (this was suggested by Professor Stern, and hence I had to ramp up on this due to missing the meeting).
Worked with Spencer and Eugene to create the demo for the upcoming week. The demo uses a prediction model based on MFCC coefficients.

Looked into audio transcription using C++, and this made us realise that audio transcription is I/O bound. This confirmed our suspicion that a signal processing based approach was the only way to move forward.
Set up the time sync infrastructure with Eugene, and fixed numerous bugs across the 2 python scripts, as well as understanding some of the source code for PyAudio to understand why some of our programs weren’t working as expected.
With a better understanding of PyAudio, Eugene and I were able to reduce the lower-bound latency even more (around 100ms).
Looked into MFCC coefficients with Spencer, but Spencer and I were unable to come up with an accurate way of comparing these coefficients across 2 different recordings. We are meeting with Professor Stern on Monday to obtain clarity on the same.

Setup venv to handle speech recognition module.
Looked at Spencer’s code involving audio to text conversion for potential improvements and optimizations.
Looked into compiled python as a way to improve performance over interpreted python. Minimal difference in performance (which hints that the program is I/O bound).
Next steps: looking to replicate this in C++ to enhance performance. Spencer and I are diverging at this point to try 2 different approaches, and see which one works. My approach should be sufficient if audio to text is computationally bound. Otherwise signal processing might be required to reduce the dependence on I/O.
Looking to use Tensorflow.

Looked at design review feedback, and started looking at more concrete metrics on NLP systems.
Worked on the design review document.
Looking into setting up time sync for our testing infrastructure.

Did code analysis on basic audio input/output demo to reduce latency.
Designed time sync infrastructure to accurately measure the time taken by our program to detect input once the user has started speaking.
Designed block diagrams for various components, figured out a general flow for the presentation.

Carried out experiments on how to “jam” wake word on Siri, since we did not have google home/alexa yet. Tests were successful with human voices. However, playing a voice recording of the jamming voice in a loop seemed to give around a 50% success rate. (Done with Spencer)
Redefined the problem as a latency problem: how do we obfuscate the wake word effectively? Need to hit the “s” sound at the same time as siri.
Latency testing for the program that Eugene wrote using python and pyaudio: good results. It is very fast to detect input + spit out a predefined output. This is without a neural net in the middle. Establishes that what we are doing is possible (done with Spencer).