This week I worked on improving the robustness of audio alignment. Unfortunately, dynamic time warping the whole reference audio to the live audio recorded by the player took too long. Therefore, a significant amount of time was spent on implementing a function we call MidiAlign. This function takes in the chroma vectors of the live audio and scans it for long durations of harmonic frequencies. This list of harmonic notes is then referenced against the reference MIDI file to find all instances of where the sequence of notes occur. To choose a instance in the reference MIDI to align to, the confidence of each possibility is weighted using the distance from where the user is currently playing as well as the number of missed notes in the sequence. Therefore, even if the user plays a wrong note, the function will not align to a drastically different section of the piece.
Another point of difficulty was dealing with latency from various points in the system. For example, librosa is a Python library that processes the audio into audio frames and also computes chroma vectors. However, this function on the first call runs caching in the background that causes the delay to rise from 20ms to 900ms. This caused our first audio alignment to lag the system and lead to undefined behavior. This was simply fixed by causing the first librosa call to occur during setup. Another point of latency was the constant call to update the webpage. This call to update the variables for the webpage was originally made every 20ms. However, this led to the system lagging. We upped this value to 50ms to give more time for the backend to process while still keeping the frontend cursor moving smoothly.
This upcoming week is the final demos. Therefore, we hope to create a demo where the whole system works along with several other modes that demonstrate the individual subsystems. Unfortunately, because eye-tracking and audio alignment are weighted together to determine the single page turn, it is hard to notice the individual contribution from each subsystem. We hope to have a mode where how eye tracking works is obvious and a mode where just audio alignment is used to turn the page. This will help the audience better understand how the system as a whole works.
Overall, we are mostly on track and will continue to work to create an enjoyable demo for the exhibition.