swyu – Team B² – Jamming Attack on Voice Recognition Systems

Status Report #11: (12/7) Spencer

Wrote final report document.
Experimented with Apple HomePod and was unsuccessful in jamming it with our program.
Experimented with the directional mic and our computers are unable to play from the speaker while having an output device (the directional mic) attached to the headphone jack.
Measured the final volume of our exploit’s output.

Last week, Eugene delivered our final presentation. Find the slides here!

Met with Prof. Stern wrt how to proceed in our attack
Rewrote Matlab version of attack code
Finished various dtw + mfcc implementations
- Constantly polling in 30ms frames vs. triggering only at a given threshold
- DTW incorporation causes much lower false positive rate. Can play music or some verbal sounds without triggering automatically, which is far better than our midpoint demo.
- Seems a little bit too slow running in practice for jamming an iPhone. Could be due to my hardware being older than Eugene’s since the attack seems to work on his computer.

Met with Prof. Stern multiple times to try resolving the issues we had with our pipeline.
Verification of our results with the Librosa MFCC + FastDTW chain is difficult due to the untransparent nature of the code.
We switched to Matlab to try making a more transparent, verifiable, and well supported pipeline.
Currently we are having trouble creating spectrograms that verify the results that we care about – the warped data should have a similar spectrogram to the reference sample to be warped, with the utterances in the same locations.
Will continue to explore fixes with this pipeline.

Implemented a version of the code with dtw + mfcc mix. Doesn’t seem to be effective, so need to figure out if it is an implementation bug or it should theoretically not work the way we have it currently implemented.
Did further reading on dtw to try understanding why this method is not working as expected.
Thinking about adding adaptive filter (https://pypi.org/project/adaptfilt/) to potentially improve performance.
Will talk to Prof. Stern next week re: dtw and figure out how to proceed.

Explored viability of using dynamic time warping for more accurate MFCC prediction.
Read FastDTW: uses time/space efficient method in O(n) time and space rather than O(n^2) time and space to perform a good approximation of the DTW algorithm. https://pdfs.semanticscholar.org/05a2/0cde15e172fc82f32774dd0cf4fe5827cad2.pdf
Exploring integration of FastDTW module on python: https://pypi.org/project/fastdtw/

Talked to Prof. Stern about MFCCs, signal processing background on what they are, and how to apply them. Discussed several different approaches with varying speeds (dynamic time warping, HMMs, deep learning).
Working on end to end system integration for in lab demo.

Since audio transcription is super slow, investigated a signal processing based approach to speed up system.
Research on MFCC & its significance wrt speech recognition
Ran tests to check speed of MFCC library (librosa).
Worked on integration of librosa with audio input from previous weeks. Added timing code – librosa can process an audio chunk from prev system in 0.005 sec, which is good news for us.
Next steps: talking to Prof. Stern about MFCC & best way to recognize matching speech. Integration of simple end to end system for in lab demo.

Setup venv to handle speech recognition module.
Created basic audio -> text proof of concept pipeline using speech recognition module in python.
Measured performance of compiled vs. interpreted python & found no noticeable difference in performance. Performance of this pipeline is really poor and takes > 1 second to run consistently.
Next steps: Investigating ways to use signal processing techniques to enhance performance/response time of basic pipeline. Ex: using MFCC coefficients may be faster than audio to text.
Possible library to look at: (https://github.com/MycroftAI/sonopy)

Based on our progress, we’ve created a design document for our project. Check it out here!