Page 2 – Team B² – Jamming Attack on Voice Recognition Systems

Status Report #8: (11/16) Cyrus

Worked on using DTW to map the time series of one signal to the time series of another signal.
Wrote DTW code that returns a path with this mapping, but we are unsure how to use this path to actually compare 2 signals, and determine whether they match. Hence, we decided to meet with Stern to figure out what conclusion we can get from DTW, and how this actually solves some of the issues that we experienced with our initial MFCC implementation.
We also started testing under different environments, and realised that ambient noise might be a problem. Hence we are thinking of using adaptive filters to filter noise out.

Wrote a version of the jammer demo that averaged MFCC samples to compare.
Ran benchmarks tests with DTW, but we’re currently blocked on understanding how to use DTW. We plan on meeting with Stern to figure out its use cases.
Initial tests don’t yield better performance than comparing against multiple samples individually. I’m going to investigate possibly adding more samples to level out and see how this changes.

Implemented a version of the code with dtw + mfcc mix. Doesn’t seem to be effective, so need to figure out if it is an implementation bug or it should theoretically not work the way we have it currently implemented.
Did further reading on dtw to try understanding why this method is not working as expected.
Thinking about adding adaptive filter (https://pypi.org/project/adaptfilt/) to potentially improve performance.
Will talk to Prof. Stern next week re: dtw and figure out how to proceed.

Updated the schedule for the team to reflect the goals for the next 4 weeks
Looking into dynamic time warping with Spencer for more accurate speech recognition. MFCC seems to be unpredictable when “Hey Siri” is said at different speeds.
Integrating the FastDTW module with Spencer: https://pypi.org/project/fastdtw/

Configured project demo, tweaking the exploit’s volume output and delay for recognition.
Looking into normalization of signals to help factor out volume in recognizing wake words.
Further research into what to do with MFCC’s: correlation is pretty low for audio sample analysis: https://www.researchgate.net/post/Why_we_take_only_12-13_MFCC_coefficients_in_feature_extraction

Explored viability of using dynamic time warping for more accurate MFCC prediction.
Read FastDTW: uses time/space efficient method in O(n) time and space rather than O(n^2) time and space to perform a good approximation of the DTW algorithm. https://pdfs.semanticscholar.org/05a2/0cde15e172fc82f32774dd0cf4fe5827cad2.pdf
Exploring integration of FastDTW module on python: https://pypi.org/project/fastdtw/

Met with Professor Stern to talk about the motivation behind and applications of MFCCs on speech detection. As of now, mean square error is not an excellent indication of correlation between two audio samples, so he recommended that we look into dynamic time warping. Vyas told us that this might extend past the scope of our project in terms of capturing every possible utterance of “Hey Siri”, but it might be useful if MFCCs continue to prove unhelpful.
Worked on designing our in-lab demo from end-to-end. Investigating the use of bash scripting to handle time synchronization because research into system time sync through Python has come up unfruitful.

Had to miss the meeting with Professor Stern due to an onsite interview. Also looked into dynamic time warping (this was suggested by Professor Stern, and hence I had to ramp up on this due to missing the meeting).
Worked with Spencer and Eugene to create the demo for the upcoming week. The demo uses a prediction model based on MFCC coefficients.

Talked to Prof. Stern about MFCCs, signal processing background on what they are, and how to apply them. Discussed several different approaches with varying speeds (dynamic time warping, HMMs, deep learning).
Working on end to end system integration for in lab demo.

Looked into audio transcription using C++, and this made us realise that audio transcription is I/O bound. This confirmed our suspicion that a signal processing based approach was the only way to move forward.
Set up the time sync infrastructure with Eugene, and fixed numerous bugs across the 2 python scripts, as well as understanding some of the source code for PyAudio to understand why some of our programs weren’t working as expected.
With a better understanding of PyAudio, Eugene and I were able to reduce the lower-bound latency even more (around 100ms).
Looked into MFCC coefficients with Spencer, but Spencer and I were unable to come up with an accurate way of comparing these coefficients across 2 different recordings. We are meeting with Professor Stern on Monday to obtain clarity on the same.